-
The code seems to not support multi-gpu training. Although I find that the code has some parts to support it , it seems to not work. Can you fix it?
-
### Please describe your problem in detail
I'm trying to start a pytorch training using volcano and pytorch plugin. I have 2 nodes, each with 8 gpus.
I found that volcano sets WORLD_SIZE = 2, RANK …
-
**How to customise the train.sh for a distributed Mamba Training ?**
Hello,
As i've seen in the megatron modules, there isn't a pre-defined bash script to pre-train a mamba model on multi-gpu, ho…
-
**Describe the bug**
While saving mamba based model, distributed optimizer report an error in validation about `dt_bias`
**To Reproduce**
Start the training of Mamba, and run it for a few step
**Exp…
-
Does Q-Galore work with FSDP or DS?
-
Hi I have a script that runs with the DataParralell trainer on a machine with 8 H100 GPUs (aws p5 VM) with deepspeed. When we run the script it starts to randomly get stuck forever at some iteration r…
-
This is to add the following knowledge to HPC Handbook including code examples with easy-setup experiment.
- GPU Computing
- Distributed GPU Computing
- Distributed Training
- Distributed Inferenc…
-
Hi,
I would like to test a program for distributed LLM model training on mi2508x and I want to do model parallel to distribute parameters across GPUs. Is there any framework that I should use to ac…
-
Hello!Thank you so much for your work
I would like to ask is there any effect on removing distributed training from model training.
Thank you!
-