Open OswaldHe opened 1 month ago
Hey Zifan-
I'm not personally very familiar with all the different options for model-parallel distributed training, but you might check the ROCm blogs since they have a lot of LLM examples on AMD GPUs: https://rocm.blogs.amd.com/blog/category/applications-models.html
-Tom
Hi Tom,
Thank you. I found what I need on the ROCm blog and will get back to you if that doesn't work.
Zifan
Hey @OswaldHe-
I know you said you found what you needed, but I figured it wouldn't hurt to share this as well...
Here is a blog post that describes how AMD trained a small-language model (SLM) on AMD GPUs w/ distributed training: https://www.amd.com/en/developer/resources/technical-articles/introducing-amd-first-slm-135m-model-fuels-ai-advancements.html
At the bottom, in the Call to Actions section, there is a link to the GitHub where you can reproduce the model yourself. I don't think you necessarily want to do that, but it should provide an example of using PyTorch FSDP for multi-node distributed training.
-Tom
Hi @OswaldHe Could you share how you got deepspeed working please? I had deepspeed training running on an other MI250 cluster but here I am getting launcher errors... Thanks for the help!
Hi @Alexis-BX
I just installed huggingface accelerate and deepspeed in a conda environment: https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed.
Hi Thanks for the reply. I already have both installed thanks. Which launcher do you use please? I usually use the pdsh launcher but it does not seem installed here. (Or if you just want to put your deepspeed command/config file I can look through it thanks!)
Hi,
I would like to test a program for distributed LLM model training on mi2508x and I want to do model parallel to distribute parameters across GPUs. Is there any framework that I should use to achieve that? I used DeepSpeed (https://github.com/microsoft/DeepSpeed), but their ZeRO stage-3 will actually increase memory consumption of all GPUs compared with ZeRO stage-2, which only do optimizer distribution. Is there any resource/recommendation and some examples specifically for AMD GPUs?
Thank you, Zifan