Open lilyswang opened 2 years ago
I am not an expert in CUDA / NCCL. But please search a bit and see if you get a solution. For example, I think these threads may be useful:
https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out https://discuss.pytorch.org/t/runtimeerror-nccl-communicator-was-aborted/136630/2
@lilyswang hello,I have the same error,have you solve the problem?
@mathmanu I have try the way in your link , but it do not work ,so sad!
Facing exact the same problem...
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug A clear and concise description of what the bug is.
Reproduction
Did you make any modifications on the code or config? Did you understand what you have modified? NO .
What dataset did you use?
My own dataset (like bdd100k), about 11.2W pics in training dataset
Thanks for your nice work,Now we have some problems and need your help. I start training with my own data set. When the training ends at one epoch, the following error will be reported:(see the attachment for the specific log)
20220112_010819.log
We look forward to your reply !!! Thanks a lot!