RuntimeError: NCCL communicator was aborted on rank 1

lilyswang commented 2 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug A clear and concise description of what the bug is.

Reproduction

What command or script did you run?

./run_detection_train.sh

Did you make any modifications on the code or config? Did you understand what you have modified? NO .
What dataset did you use?

My own dataset (like bdd100k), about 11.2W pics in training dataset

Thanks for your nice work,Now we have some problems and need your help. I start training with my own data set. When the training ends at one epoch, the following error will be reported:（see the attachment for the specific log）

20220112_010819.log

We look forward to your reply ！！！ Thanks a lot！

mathmanu commented 2 years ago

I am not an expert in CUDA / NCCL. But please search a bit and see if you get a solution. For example, I think these threads may be useful:

https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out https://discuss.pytorch.org/t/runtimeerror-nccl-communicator-was-aborted/136630/2

malianghui commented 2 years ago

@lilyswang hello,I have the same error,have you solve the problem?

malianghui commented 2 years ago

@mathmanu I have try the way in your link , but it do not work ,so sad!

weiyx16 commented 2 years ago

Facing exact the same problem...

TexasInstruments / edgeai-mmdetection

RuntimeError: NCCL communicator was aborted on rank 1 #8