AMDResearch / hpcfund

AMD HPC Research Fund Cloud
https://amdresearch.github.io/hpcfund/
11 stars 4 forks source link

Get help for distributed model training on MI250 #30

Open OswaldHe opened 1 week ago

OswaldHe commented 1 week ago

Hi,

I would like to test a program for distributed LLM model training on mi2508x and I want to do model parallel to distribute parameters across GPUs. Is there any framework that I should use to achieve that? I used DeepSpeed (https://github.com/microsoft/DeepSpeed), but their ZeRO stage-3 will actually increase memory consumption of all GPUs compared with ZeRO stage-2, which only do optimizer distribution. Is there any resource/recommendation and some examples specifically for AMD GPUs?

Thank you, Zifan

tom-papatheodore commented 5 days ago

Hey Zifan-

I'm not personally very familiar with all the different options for model-parallel distributed training, but you might check the ROCm blogs since they have a lot of LLM examples on AMD GPUs: https://rocm.blogs.amd.com/blog/category/applications-models.html

-Tom

OswaldHe commented 5 days ago

Hi Tom,

Thank you. I found what I need on the ROCm blog and will get back to you if that doesn't work.

Zifan

tom-papatheodore commented 4 days ago

Hey @OswaldHe-

I know you said you found what you needed, but I figured it wouldn't hurt to share this as well...

Here is a blog post that describes how AMD trained a small-language model (SLM) on AMD GPUs w/ distributed training: https://www.amd.com/en/developer/resources/technical-articles/introducing-amd-first-slm-135m-model-fuels-ai-advancements.html

At the bottom, in the Call to Actions section, there is a link to the GitHub where you can reproduce the model yourself. I don't think you necessarily want to do that, but it should provide an example of using PyTorch FSDP for multi-node distributed training.

-Tom