X-LANCE / SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model
MIT License
574 stars 52 forks source link

NCCL error when saving with DDP #109

Open Vindicator645 opened 4 months ago

Vindicator645 commented 4 months ago

System Info

8*A100 with docker enviroment

Information

🐛 Describe the bug

training always abort after saving the checkpoint for 249999th step, I presume the model saving process in rank 0 disrupts the nccl communication somehow. According to logs ,the saving process is no where near the time out threshold of nccl(which should be 30min by default). Any advice on how to resolve this issue would be helpful!

Error logs

image

Expected behavior

nccl timeout error after a certain steps of training

cnlinxi commented 4 months ago

Same problem. Do you have any solution?

zhangron013 commented 2 months ago

same problem too, Do you have any solution?