When running the eval.py script with "--use_dist True", I am facing this error:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
I am using this Docker image: "nvcr.io/nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04" since the one mentioned in the original Dockerfile is no longer available on the Docker hub.
Any suggestion about what could the problem be?
Thank you in advance
When running the eval.py script with "--use_dist True", I am facing this error: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
I am using this Docker image: "nvcr.io/nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04" since the one mentioned in the original Dockerfile is no longer available on the Docker hub.
Any suggestion about what could the problem be? Thank you in advance