NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.87k stars 2.23k forks source link

Distributed Mamba Training #944

Open SkanderBS2024 opened 1 month ago

SkanderBS2024 commented 1 month ago

How to customise the train.sh for a distributed Mamba Training ?

Hello, As i've seen in the megatron modules, there isn't a pre-defined bash script to pre-train a mamba model on multi-gpu, how can i set it up for model / data parallelism ...

deepakn94 commented 1 month ago

This runs training on 8 GPUs: https://github.com/NVIDIA/Megatron-LM/blob/ssm/examples/mamba/train.sh. You can extend to multi-node by passing in appropriate arguments to torchrun (adapted from https://pytorch.org/docs/stable/elastic/run.html#usage):

torchrun
    --nnodes=$NUM_NODES
    --nproc-per-node=$NUM_TRAINERS
    --rdzv-id=$JOB_ID
    --rdzv-backend=c10d
    --rdzv-endpoint=$HOST_NODE_ADDR
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
SkanderBS2024 commented 1 month ago

I meant multi-nodes*, so multi-nodes (--nnodes) and gpu_count (-nproc-per-node). Thank you !

SkanderBS2024 commented 1 month ago

@deepakn94 is it possible to set up the gpus dynamically during training ? (for example i have a total of 180 GPU's 90 of them are fixed for the whole training time and the 90 others will not be always available, a sort of switching where the other gpus are available) ?

deepakn94 commented 1 month ago

Maybe the elastic options here are useful: https://pytorch.org/docs/stable/elastic/run.html#elastic-min-1-max-4-tolerates-up-to-3-membership-changes-or-failures?

SkanderBS2024 commented 1 month ago

I'll take a look thank you !