Open udhavsethi opened 1 year ago
It doesn't look like NCCL_DEBUG=INFO
was taken into account. We should see NCCL INFO
or NCCL WARN
messages before the error happens.
My bad, please see the attached log for the whole output: log.txt
I could not find anything clear. It would look like some of the ranks are failing (like, crashing outside of NCCL) and that causes NCCL to fail to connect to those ranks and report an error. The error reported is InternalError, which I believe is wrong, and the code returning that error has changed on recent versions, so perhaps we should not return that error code since it's not an internal error but just a classic remote error where the other side is not responding.
Could you run with a more recent NCCL? Also could you check how each rank exits? Some ranks do error out in NCCL (as they could not connect to other ranks) but maybe some other ranks are exiting differently due to some other reason.
I am trying to run a training script using deepspeed on 8 32GB V100 GPUs.
For debugging, I enabled the following flags:
I am running into the following errors:
Here is my nvcc version:
and nccl version:
Here is the dumped xml file: topo.xml.txt
Please let me know if I can provide any other information to identify the source of this issue. I would highly appreciate any help or guidance on how to make this work.