TexasInstruments / edgeai-mmdetection

This repository has been moved. The new location is in https://github.com/TexasInstruments/edgeai-tensorlab
https://github.com/TexasInstruments/edgeai
Other
0 stars 0 forks source link

RuntimeError: NCCL communicator was aborted on rank 1 #8

Open lilyswang opened 2 years ago

lilyswang commented 2 years ago

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug A clear and concise description of what the bug is.

Reproduction

  1. What command or script did you run?
./run_detection_train.sh
  1. Did you make any modifications on the code or config? Did you understand what you have modified? NO .

  2. What dataset did you use?

My own dataset (like bdd100k), about 11.2W pics in training dataset

Thanks for your nice work,Now we have some problems and need your help. I start training with my own data set. When the training ends at one epoch, the following error will be reported:(see the attachment for the specific log)

image

20220112_010819.log

We look forward to your reply !!! Thanks a lot!

mathmanu commented 2 years ago

I am not an expert in CUDA / NCCL. But please search a bit and see if you get a solution. For example, I think these threads may be useful:

https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out https://discuss.pytorch.org/t/runtimeerror-nccl-communicator-was-aborted/136630/2

malianghui commented 2 years ago

@lilyswang hello,I have the same error,have you solve the problem?

malianghui commented 2 years ago

@mathmanu I have try the way in your link , but it do not work ,so sad!

weiyx16 commented 2 years ago

Facing exact the same problem...