Boese0601 / RC-MVSNet

[ECCV 2022] RC-MVSNet: Unsupervised Multi-View Stereo with Neural Rendering
https://boese0601.github.io/rc-mvsnet/
MIT License
206 stars 14 forks source link

RuntimeError: NCCL error in #5

Open cqtanzj opened 2 years ago

cqtanzj commented 2 years ago

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180594101/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3

tech-fisher commented 1 year ago

I met the similar error,pls help

bobfacer commented 1 year ago

I met this problem too. RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

bobfacer commented 1 year ago

it can work using 1 gpu

Boese0601 commented 1 year ago

Looks like issue because of the DistributedDataParallel. Have you installed pytorch and cuda according to the provided version?

DongyangHuLi commented 1 year ago

I configured my environment exactly as the readme file, but it still didn't work.

Boese0601 commented 1 year ago

I configured my environment exactly as the readme.txt file, but it still didn't work.

What's your graphics card and cuda version?

DongyangHuLi commented 1 year ago

I configured my environment exactly as the readme.txt file, but it still didn't work.

What's your graphics card and cuda version?

RTX 3090 and 11.4 image and the error is: image Could you give me some helps? :)

alexrich021 commented 1 year ago

The issue is argparse isn't properly parsing the --gpu argument into a list. train_rcmvsnet.py:125 then sets the world size to the length of the string passed to --gpu (i.e. 5 when using --gpu [0,1]).

Just change train_rcmvsnet.py:68 to

parser.add_argument('--gpu',default=[0],help='gpu',nargs='+',type=int)

and pass the gpu args as --gpu 0 1 instead. That solved it for me anyway.