NCCL error when saving with DDP

Vindicator645 commented 4 months ago

System Info

8*A100 with docker enviroment

Information

[x] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

training always abort after saving the checkpoint for 249999th step, I presume the model saving process in rank 0 disrupts the nccl communication somehow. According to logs ,the saving process is no where near the time out threshold of nccl(which should be 30min by default). Any advice on how to resolve this issue would be helpful!