Open cqtanzj opened 2 years ago
I met the similar error,pls help
I met this problem too.
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
it can work using 1 gpu
Looks like issue because of the DistributedDataParallel. Have you installed pytorch and cuda according to the provided version?
I configured my environment exactly as the readme file, but it still didn't work.
I configured my environment exactly as the readme.txt file, but it still didn't work.
What's your graphics card and cuda version?
I configured my environment exactly as the readme.txt file, but it still didn't work.
What's your graphics card and cuda version?
RTX 3090 and 11.4 and the error is: Could you give me some helps? :)
The issue is argparse isn't properly parsing the --gpu
argument into a list. train_rcmvsnet.py:125
then sets the world size to the length of the string passed to --gpu
(i.e. 5 when using --gpu [0,1]
).
Just change train_rcmvsnet.py:68
to
parser.add_argument('--gpu',default=[0],help='gpu',nargs='+',type=int)
and pass the gpu args as --gpu 0 1
instead. That solved it for me anyway.
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180594101/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3