Open hamid-ramezani opened 3 years ago
It would be helpful if you could provide a trace for the error -- not sure what problem you're encountering in your experiments.
Trying with the latest NCCL would also be a good idea and save us time if it happens to be something that was fixed.
It would be helpful if you could provide a trace for the error -- not sure what problem you're encountering in your experiments.
Trying with the latest NCCL would also be a good idea and save us time if it happens to be something that was fixed.
Is it because I am running it on WSL 2?
I'm not sure how this is related to the issue above.
Regardless, WSL2 is not supported by NCCL 2.7.6. You may want to try NCCL 2.10.
@sjeaugey ah, it was the same ncclSystemError but guess that's just a generic error for it not working.
I upgraded to nccl 2.10 via the Nvidia documentation. However my program still says nccl 2.7.8.
Sorry for the noob question, but where would I upgrade the configuration for the code to read nccl 2.10? I'm using DeepSpeed with pyTorch Lightning. Don't know if they need to upgrade, I need to configure something in code or if this is something that needs to be configured on my system/environment?
Indeed, the NCCL_DEBUG=WARN message is what really matters.
As for upgrading NCCL, some frameworks link NCCL statically, so upgrading NCCL sometimes doesn't have any effect as you would still use the version embedded with the framework. Unfortunately I'm not very familiar with PyTorch Lightning. Perhaps they would be able to tell you how to use a newer version of NCCL.
Indeed, the NCCL_DEBUG=WARN message is what really matters.
As for upgrading NCCL, some frameworks link NCCL statically, so upgrading NCCL sometimes doesn't have any effect as you would still use the version embedded with the framework. Unfortunately I'm not very familiar with PyTorch Lightning. Perhaps they would be able to tell you how to use a newer version of NCCL.
It seems that pyTorch Lightning just uses the pyTorch backend. Does that mean I have to wait for pyTorch to upgrade nccl versions ?
I would ask that question to the PyTorch project; they should know how to replace the NCCL version, or which PyTorch version to pick to get NCCL 2.10.
Hey,
I'm trying to do a machine learning task on two nodes (each with 4 RTX3090 GPUs). I like to set
NCCL_MIN_NCHANNELS
to a number greater than 4. When I do the training on single node multiple GPUs, everything works fine (I can setNCCL_MIN_NCHANNELS
to 32 and everything is ok). However, when I am trying to use two machines, I get error when I setNCCL_MIN_NCHANNELS
to a number greater than 2.My nccl version is
2.8.4
. I recently saw your recent commit. One of the things you did isFix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
I was wondering if this is related to the error I'm getting? If so, is there any environment variable in version 2.8.4 so that I can solve my problem?