NVIDIA / modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
https://developer.nvidia.com/modulus
Apache License 2.0
937 stars 219 forks source link

🐛[BUG]: Graphcast: Error when running mpirun --allow-run-as-root -np 3 for GraphCast model, but works with -np 2 #539

Closed Flionay closed 3 months ago

Flionay commented 4 months ago

Version

0.5.0

On which installation method(s) does this occur?

Docker

Describe the issue

When I run the GraphCast model with mpirun --allow-run-as-root -np 3 python train_graphcast.py, I encounter an error. However, when I use mpirun --allow-run-as-root -np 2 python train_graphcast.py, the model runs without any issues.

I am seeking help to identify the potential cause of this problem. Below is the output log from my program:

Minimum reproducible example

mpirun --allow-run-as-root -np 3 python train_graphcast.py

Relevant log output

Cuda failure 1 'invalid argument'
Traceback (most recent call last):
  File "/graphcast/train_graphcast_2to1.py", line 523, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/graphcast/train_graphcast_2to1.py", line 340, in main
    trainer = GraphCastTrainer(cfg, dist, rank_zero_logger)
  File "/graphcast/train_graphcast_2to1.py", line 145, in __init__
    self.model = DistributedDataParallel(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 783, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 264, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1727, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 1 'invalid argument'
----------------------------------------------
59b88225e9fb:14085:14085 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.6<0>
59b88225e9fb:14085:14085 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
59b88225e9fb:14085:14085 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
59b88225e9fb:14085:14085 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
59b88225e9fb:14085:14085 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
59b88225e9fb:14085:14085 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.19.3+cuda12.3
59b88225e9fb:14087:14087 [2] NCCL INFO cudaDriverVersion 12030
59b88225e9fb:14087:14087 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.6<0>
59b88225e9fb:14087:14087 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
59b88225e9fb:14087:14087 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
59b88225e9fb:14087:14087 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
59b88225e9fb:14087:14087 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
59b88225e9fb:14085:14711 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
59b88225e9fb:14085:14711 [0] NCCL INFO P2P plugin IBext
59b88225e9fb:14085:14711 [0] NCCL INFO NET/IB : No device found.
59b88225e9fb:14085:14711 [0] NCCL INFO NET/IB : No device found.
59b88225e9fb:14085:14711 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.6<0>
59b88225e9fb:14085:14711 [0] NCCL INFO Using non-device net plugin version 0
59b88225e9fb:14085:14711 [0] NCCL INFO Using network Socket
59b88225e9fb:14087:14712 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
59b88225e9fb:14087:14712 [2] NCCL INFO P2P plugin IBext
59b88225e9fb:14087:14712 [2] NCCL INFO NET/IB : No device found.
59b88225e9fb:14087:14712 [2] NCCL INFO NET/IB : No device found.
59b88225e9fb:14087:14712 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.6<0>
59b88225e9fb:14087:14712 [2] NCCL INFO Using non-device net plugin version 0
59b88225e9fb:14087:14712 [2] NCCL INFO Using network Socket
59b88225e9fb:14086:14086 [1] NCCL INFO cudaDriverVersion 12030
59b88225e9fb:14086:14086 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.6<0>
59b88225e9fb:14086:14086 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
59b88225e9fb:14086:14086 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
59b88225e9fb:14086:14086 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
59b88225e9fb:14086:14086 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
59b88225e9fb:14086:14713 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
59b88225e9fb:14086:14713 [1] NCCL INFO P2P plugin IBext
59b88225e9fb:14086:14713 [1] NCCL INFO NET/IB : No device found.
59b88225e9fb:14086:14713 [1] NCCL INFO NET/IB : No device found.
59b88225e9fb:14086:14713 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.6<0>
59b88225e9fb:14086:14713 [1] NCCL INFO Using non-device net plugin version 0
59b88225e9fb:14086:14713 [1] NCCL INFO Using network Socket
59b88225e9fb:14086:14713 [1] NCCL INFO comm 0x5574f22e0f80 rank 1 nranks 3 cudaDev 1 nvmlDev 3 busId 66000 commId 0xb80ced506beebbca - Init START
59b88225e9fb:14085:14711 [0] NCCL INFO comm 0x5613bb376e20 rank 0 nranks 3 cudaDev 0 nvmlDev 2 busId 3f000 commId 0xb80ced506beebbca - Init START
59b88225e9fb:14087:14712 [2] NCCL INFO comm 0x55eb79b74d90 rank 2 nranks 3 cudaDev 2 nvmlDev 4 busId 9b000 commId 0xb80ced506beebbca - Init START
59b88225e9fb:14086:14713 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
59b88225e9fb:14086:14713 [1] NCCL INFO NVLS multicast support is available on dev 1
59b88225e9fb:14085:14711 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
59b88225e9fb:14087:14712 [2] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000
59b88225e9fb:14087:14712 [2] NCCL INFO NVLS multicast support is available on dev 2
59b88225e9fb:14085:14711 [0] NCCL INFO NVLS multicast support is available on dev 0
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 00/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 01/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 02/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 03/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 04/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 05/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 06/24 :    0   1   2
59b88225e9fb:14087:14712 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1 [4] -1/-1/-1->2->1 [5] -1/-1/-1->2->1 [6] 0/-1/-1->2->1 [7] 0/-1/-1->2->1 [8] 0/-1/-1->2->1 [9] 0/-1/-1->2->-1 [10] 0/-1/-1->2->-1 [11] 0/-1/-1->2->-1 [12] -1/-1/-1->2->1 [13] -1/-1/-1->2->1 [14] -1/-1/-1->2->1 [15] -1/-1/-1->2->1 [16] -1/-1/-1->2->1 [17] -1/-1/-1->2->1 [18] 0/-1/-1->2->1 [19] 0/-1/-1->2->1 [20] 0/-1/-1->2->1 [21] 0/-1/-1->2->-1 [22] 0/-1/-1->2->-1 [23] 0/-1/-1->2->-1
59b88225e9fb:14087:14712 [2] NCCL INFO P2P Chunksize set to 524288
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 07/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 08/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 09/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 10/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 11/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 12/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 13/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 14/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 15/24 :    0   1   2
59b88225e9fb:14086:14713 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->-1 [7] 2/-1/-1->1->-1 [8] 2/-1/-1->1->-1 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->-1 [19] 2/-1/-1->1->-1 [20] 2/-1/-1->1->-1 [21] -1/-1/-1->1->0 [22] -1/-1/-1->1->0 [23] -1/-1/-1->1->0
59b88225e9fb:14086:14713 [1] NCCL INFO P2P Chunksize set to 524288
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 16/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 17/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 18/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 19/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 20/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 21/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 22/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 23/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->2 [7] -1/-1/-1->0->2 [8] -1/-1/-1->0->2 [9] 1/-1/-1->0->2 [10] 1/-1/-1->0->2 [11] 1/-1/-1->0->2 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->2 [19] -1/-1/-1->0->2 [20] -1/-1/-1->0->2 [21] 1/-1/-1->0->2 [22] 1/-1/-1->0->2 [23] 1/-1/-1->0->2
59b88225e9fb:14085:14711 [0] NCCL INFO P2P Chunksize set to 524288
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 00/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 01/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 02/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 03/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 04/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 05/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 06/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 07/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 08/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 09/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 10/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 11/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 12/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 13/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 14/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 15/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 16/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 02/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 17/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 03/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 18/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 04/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 19/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 05/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 20/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 06/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 21/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 07/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 22/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 08/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 23/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 09/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 10/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 11/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 12/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 13/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 14/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 15/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 16/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 17/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 18/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 19/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 20/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 21/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 22/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 23/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 00/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 01/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 02/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 03/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 04/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 05/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 06/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 07/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 08/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 09/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 10/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 11/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 12/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 13/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 14/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 15/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 16/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 17/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 18/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 19/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 20/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 21/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 22/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 23/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Connected all rings
59b88225e9fb:14085:14711 [0] NCCL INFO Connected all rings
59b88225e9fb:14087:14712 [2] NCCL INFO Connected all rings
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 06/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 07/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 08/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 09/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 10/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 11/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 18/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 19/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 20/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 21/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 22/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 23/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 00/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 01/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 02/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 03/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 04/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 05/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 06/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 07/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 08/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 12/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 13/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 14/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 15/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 16/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 17/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 18/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 19/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 20/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 02/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 03/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 04/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 05/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 09/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 10/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 11/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 12/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 13/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 14/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 15/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 16/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 17/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 21/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 22/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 23/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Connected all trees
59b88225e9fb:14087:14712 [2] NCCL INFO Connected all trees
59b88225e9fb:14085:14711 [0] NCCL INFO Connected all trees
59b88225e9fb:14086:14713 [1] NCCL INFO NVLS comm 0x5574f22e0f80 headRank 1 nHeads 3 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 603979776
59b88225e9fb:14087:14712 [2] NCCL INFO NVLS comm 0x55eb79b74d90 headRank 2 nHeads 3 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 603979776
59b88225e9fb:14085:14711 [0] NCCL INFO NVLS comm 0x5613bb376e20 headRank 0 nHeads 3 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 603979776

59b88225e9fb:14086:14713 [1] transport/nvls.cc:169 NCCL WARN Cuda failure 1 'invalid argument'
59b88225e9fb:14086:14713 [1] NCCL INFO transport/nvls.cc:339 -> 1
59b88225e9fb:14086:14713 [1] NCCL INFO init.cc:1131 -> 1

59b88225e9fb:14085:14711 [0] transport/nvls.cc:169 NCCL WARN Cuda failure 1 'invalid argument'
59b88225e9fb:14085:14711 [0] NCCL INFO transport/nvls.cc:339 -> 1
59b88225e9fb:14085:14711 [0] NCCL INFO init.cc:1131 -> 1
59b88225e9fb:14086:14713 [1] NCCL INFO init.cc:1396 -> 1
59b88225e9fb:14086:14713 [1] NCCL INFO group.cc:64 -> 1 [Async thread]
59b88225e9fb:14085:14711 [0] NCCL INFO init.cc:1396 -> 1
59b88225e9fb:14085:14711 [0] NCCL INFO group.cc:64 -> 1 [Async thread]

59b88225e9fb:14087:14712 [2] transport/nvls.cc:169 NCCL WARN Cuda failure 1 'invalid argument'
59b88225e9fb:14087:14712 [2] NCCL INFO transport/nvls.cc:339 -> 1
59b88225e9fb:14087:14712 [2] NCCL INFO init.cc:1131 -> 1
59b88225e9fb:14087:14712 [2] NCCL INFO init.cc:1396 -> 1
59b88225e9fb:14087:14712 [2] NCCL INFO group.cc:64 -> 1 [Async thread]
59b88225e9fb:14087:14087 [2] NCCL INFO group.cc:418 -> 1
59b88225e9fb:14087:14087 [2] NCCL INFO group.cc:95 -> 1
59b88225e9fb:14086:14086 [1] NCCL INFO group.cc:418 -> 1
59b88225e9fb:14086:14086 [1] NCCL INFO group.cc:95 -> 1
59b88225e9fb:14085:14085 [0] NCCL INFO group.cc:418 -> 1
59b88225e9fb:14085:14085 [0] NCCL INFO group.cc:95 -> 1
Error executing job with overrides: []
Traceback (most recent call last):
  File "/graphcast/train_graphcast_2to1.py", line 523, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/graphcast/train_graphcast_2to1.py", line 340, in main
    trainer = GraphCastTrainer(cfg, dist, rank_zero_logger)
  File "/graphcast/train_graphcast_2to1.py", line 145, in __init__
    self.model = DistributedDataParallel(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 783, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 264, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1727, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 1 'invalid argument'

Environment details

No response

Flionay commented 3 months ago

I wanted to follow up on this issue. Upon further investigation, I realized that the problem was not with the project code but with my local environment. Therefore, I am closing this issue.

For anyone encountering similar issues, I found the cause and solution related to the environment in this discussion: NVIDIA/nccl#976.

Thank you for your time and support.