facebookresearch / ViewDiff

ViewDiff generates high-quality, multi-view consistent images of a real-world 3D object in authentic surroundings. (CVPR2024).
Other
311 stars 20 forks source link

ncclInvalidUsage error #24

Closed xiyichen closed 1 month ago

xiyichen commented 1 month ago

I tried to run train_small.sh and got an NCCL error which suggests "ncclInvalidUsage" with "Duplicate GPU detected". I think it requested 4 NCCL processes while I only have 1 GPU.

Here's part of the log:

09/04/2024 02:49:57 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:0

Mixed precision type: no

09/04/2024 02:49:57 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:0

Mixed precision type: no

09/04/2024 02:49:57 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:0

Mixed precision type: no

09/04/2024 02:49:58 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0

...

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.19.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.19.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 1000
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.19.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 1000
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.19.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
[2024-09-04 02:39:29,793] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 493262 closing signal SIGTERM
[2024-09-04 02:39:29,794] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 493264 closing signal SIGTERM
[2024-09-04 02:39:29,794] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 493265 closing signal SIGTERM

How could I set the number of process to 1? I tried "CUDA_VISIBLE_DEVICES=0" but it wasn't helpful.