Closed looninho closed 2 years ago
Hi,
thank you for sharing your work.
I'm trying to test DEKR but facing with NCLL issue. When I run the train.py, it returns error:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Could you give me some tips to overcome this?
Environment: CUDA: GPU:
System:
I met the same problem.Do you know how to solve it now? Thanks a lot if you can inform me!!!
Hi @longpeace,
I solved the ncclSystemError issue by adding --ipc=host flag in the docker command.
--ipc=host
[SOLVED]
Hi,
thank you for sharing your work.
I'm trying to test DEKR but facing with NCLL issue. When I run the train.py, it returns error:
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Could you give me some tips to overcome this?
Environment: CUDA: GPU:
System: