training always abort after saving the checkpoint for 249999th step, I presume the model saving process in rank 0 disrupts the nccl communication somehow. According to logs ,the saving process is no where near the time out threshold of nccl(which should be 30min by default). Any advice on how to resolve this issue would be helpful!
Error logs
Expected behavior
nccl timeout error after a certain steps of training
System Info
8*A100 with docker enviroment
Information
🐛 Describe the bug
training always abort after saving the checkpoint for 249999th step, I presume the model saving process in rank 0 disrupts the nccl communication somehow. According to logs ,the saving process is no where near the time out threshold of nccl(which should be 30min by default). Any advice on how to resolve this issue would be helpful!
Error logs
Expected behavior
nccl timeout error after a certain steps of training