ncclSystemError: System call (socket, malloc, munmap, etc) failed.

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Other

3.26k stars 827 forks source link

ncclSystemError: System call (socket, malloc, munmap, etc) failed. #511

Open hamid-ramezani opened 3 years ago

hamid-ramezani commented 3 years ago

Hey,

I'm trying to do a machine learning task on two nodes (each with 4 RTX3090 GPUs). I like to set NCCL_MIN_NCHANNELS to a number greater than 4. When I do the training on single node multiple GPUs, everything works fine (I can set NCCL_MIN_NCHANNELS to 32 and everything is ok). However, when I am trying to use two machines, I get error when I set NCCL_MIN_NCHANNELS to a number greater than 2.

My nccl version is 2.8.4. I recently saw your recent commit. One of the things you did is Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels. I was wondering if this is related to the error I'm getting? If so, is there any environment variable in version 2.8.4 so that I can solve my problem?

sjeaugey commented 3 years ago

It would be helpful if you could provide a trace for the error -- not sure what problem you're encountering in your experiments.

Trying with the latest NCCL would also be a good idea and save us time if it happens to be something that was fixed.

ZetiMente commented 3 years ago

It would be helpful if you could provide a trace for the error -- not sure what problem you're encountering in your experiments.

Trying with the latest NCCL would also be a good idea and save us time if it happens to be something that was fixed.

Is it because I am running it on WSL 2?

sjeaugey commented 3 years ago

I'm not sure how this is related to the issue above.

Regardless, WSL2 is not supported by NCCL 2.7.6. You may want to try NCCL 2.10.

ZetiMente commented 3 years ago

@sjeaugey ah, it was the same ncclSystemError but guess that's just a generic error for it not working.

I upgraded to nccl 2.10 via the Nvidia documentation. However my program still says nccl 2.7.8.

Sorry for the noob question, but where would I upgrade the configuration for the code to read nccl 2.10? I'm using DeepSpeed with pyTorch Lightning. Don't know if they need to upgrade, I need to configure something in code or if this is something that needs to be configured on my system/environment?

sjeaugey commented 3 years ago

Indeed, the NCCL_DEBUG=WARN message is what really matters.

As for upgrading NCCL, some frameworks link NCCL statically, so upgrading NCCL sometimes doesn't have any effect as you would still use the version embedded with the framework. Unfortunately I'm not very familiar with PyTorch Lightning. Perhaps they would be able to tell you how to use a newer version of NCCL.

ZetiMente commented 3 years ago

Indeed, the NCCL_DEBUG=WARN message is what really matters.

As for upgrading NCCL, some frameworks link NCCL statically, so upgrading NCCL sometimes doesn't have any effect as you would still use the version embedded with the framework. Unfortunately I'm not very familiar with PyTorch Lightning. Perhaps they would be able to tell you how to use a newer version of NCCL.

It seems that pyTorch Lightning just uses the pyTorch backend. Does that mean I have to wait for pyTorch to upgrade nccl versions ?

sjeaugey commented 3 years ago

I would ask that question to the PyTorch project; they should know how to replace the NCCL version, or which PyTorch version to pick to get NCCL 2.10.