I tried to run train_small.sh and got an NCCL error which suggests "ncclInvalidUsage" with "Duplicate GPU detected". I think it requested 4 NCCL processes while I only have 1 GPU.
Here's part of the log:
09/04/2024 02:49:57 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:0
Mixed precision type: no
09/04/2024 02:49:57 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:0
Mixed precision type: no
09/04/2024 02:49:57 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:0
Mixed precision type: no
09/04/2024 02:49:58 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
...
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.19.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.19.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 3 and rank 0 both on CUDA device 1000
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.19.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 1000
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.19.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
[2024-09-04 02:39:29,793] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 493262 closing signal SIGTERM
[2024-09-04 02:39:29,794] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 493264 closing signal SIGTERM
[2024-09-04 02:39:29,794] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 493265 closing signal SIGTERM
How could I set the number of process to 1? I tried "CUDA_VISIBLE_DEVICES=0" but it wasn't helpful.
I tried to run train_small.sh and got an NCCL error which suggests "ncclInvalidUsage" with "Duplicate GPU detected". I think it requested 4 NCCL processes while I only have 1 GPU.
Here's part of the log:
How could I set the number of process to 1? I tried "CUDA_VISIBLE_DEVICES=0" but it wasn't helpful.