Get help for distributed model training on MI250

AMDResearch / hpcfund

AMD HPC Research Fund Cloud

https://amdresearch.github.io/hpcfund/

11 stars 4 forks source link

Get help for distributed model training on MI250 #30

Open OswaldHe opened 1 month ago

OswaldHe commented 1 month ago

Hi,

I would like to test a program for distributed LLM model training on mi2508x and I want to do model parallel to distribute parameters across GPUs. Is there any framework that I should use to achieve that? I used DeepSpeed (https://github.com/microsoft/DeepSpeed), but their ZeRO stage-3 will actually increase memory consumption of all GPUs compared with ZeRO stage-2, which only do optimizer distribution. Is there any resource/recommendation and some examples specifically for AMD GPUs?

Thank you, Zifan

tom-papatheodore commented 1 month ago

Hey Zifan-

I'm not personally very familiar with all the different options for model-parallel distributed training, but you might check the ROCm blogs since they have a lot of LLM examples on AMD GPUs: https://rocm.blogs.amd.com/blog/category/applications-models.html

-Tom

OswaldHe commented 1 month ago

Hi Tom,

Thank you. I found what I need on the ROCm blog and will get back to you if that doesn't work.

Zifan

tom-papatheodore commented 1 month ago

Hey @OswaldHe-

I know you said you found what you needed, but I figured it wouldn't hurt to share this as well...

Here is a blog post that describes how AMD trained a small-language model (SLM) on AMD GPUs w/ distributed training: https://www.amd.com/en/developer/resources/technical-articles/introducing-amd-first-slm-135m-model-fuels-ai-advancements.html

At the bottom, in the Call to Actions section, there is a link to the GitHub where you can reproduce the model yourself. I don't think you necessarily want to do that, but it should provide an example of using PyTorch FSDP for multi-node distributed training.

-Tom

Alexis-BX commented 2 weeks ago

Hi @OswaldHe Could you share how you got deepspeed working please? I had deepspeed training running on an other MI250 cluster but here I am getting launcher errors... Thanks for the help!

OswaldHe commented 2 weeks ago

Hi @Alexis-BX

I just installed huggingface accelerate and deepspeed in a conda environment: https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed.

Alexis-BX commented 2 weeks ago

Hi Thanks for the reply. I already have both installed thanks. Which launcher do you use please? I usually use the pdsh launcher but it does not seem installed here. (Or if you just want to put your deepspeed command/config file I can look through it thanks!)