Closed carbonscott closed 3 weeks ago
Use srun as the launcher: srun --ntasks-per-node=4 ... python train.py
srun
srun --ntasks-per-node=4 ... python train.py
Refer to page 22 at https://docs.google.com/presentation/d/1FB2vqlibSWECRsCOFK2tMr_jT_PVyM06nAgmIR5qzhE/edit#slide=id.g29a556e7c6f_1_67
NCCL issues were also reported https://github.com/NVIDIA/nccl/issues/1024 and https://github.com/hiyouga/LLaMA-Factory/issues/1169
Just use pytorch 2.0.1 for now.
Use
srun
as the launcher:srun --ntasks-per-node=4 ... python train.py
Refer to page 22 at https://docs.google.com/presentation/d/1FB2vqlibSWECRsCOFK2tMr_jT_PVyM06nAgmIR5qzhE/edit#slide=id.g29a556e7c6f_1_67