I'm training a model with multi 3090 GPUs. It normally starts initially. But the program hangs up at the end of the training of an epoch, there is no more output, and some GPUs are 100% occupied, while others are 0%. The program ends when NCCL timeout. I looked at the output and it showed up at the end:
It looks like some GPUs exited early, causing synchronization to fail.
There should be no problem with the code. The code can be run on the data set before data augmentation. But when running on the data set after data enhancement, this problem occurs, like https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out.
I tried to export NCCL_P2P_DISABLE=1. But it didn't work.
I'm training a model with multi 3090 GPUs. It normally starts initially. But the program hangs up at the end of the training of an epoch, there is no more output, and some GPUs are 100% occupied, while others are 0%. The program ends when NCCL timeout. I looked at the output and it showed up at the end:
It looks like some GPUs exited early, causing synchronization to fail. There should be no problem with the code. The code can be run on the data set before data augmentation. But when running on the data set after data enhancement, this problem occurs, like https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out. I tried to export NCCL_P2P_DISABLE=1. But it didn't work.