多个节点多卡的pretrain

lixin716 commented 6 months ago

请问作者这个能用多个节点多卡进行分布式训练吗，我用4个节点，每个节点两张gpu，但只有一个节点正常工作，另外几个节点的GPU并没有工作。

谢谢！

DLLXW commented 6 months ago

请问作者这个能用多个节点多卡进行分布式训练吗，我用4个节点，每个节点两张gpu，但只有一个节点正常工作，另外几个节点的GPU并没有工作。

谢谢！

应该是可以直接支持的，以下来源于llama2.c： To run on a single GPU small debug run, example: $ python -m train.py --compile=False --eval_iters=10 --batch_size=8

To run with DDP on 4 gpus on 1 node, example: $ torchrun --standalone --nproc_per_node=4 train.py

To run with DDP on 4 gpus across 2 nodes, example:

Run on the first (master) node with example IP 123.456.123.456: $ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 train.py
Run on the worker node: $ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 train.py (If your cluster does not have Infiniband interconnect prepend NCCL_IB_DISABLE=1)

lixin716 commented 6 months ago

好的谢谢！

DLLXW / baby-llama2-chinese