NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 817 forks source link

NCCL INFO NET/IB : No device found #952

Open sedrick-keh-tri opened 1 year ago

sedrick-keh-tri commented 1 year ago

(Raised a similar issue in the Megatron repo, but I think it might be more appropriate here, so I'm adding more details)

I am trying to run Megatron multi-node on Docker.

I have a docker set up on both nodes. Specifically, I run the following in each node:

docker run --gpus all --shm-size=1g --ipc=host --network=host --env NCCL_DEBUG=INFO -it --rm -v /home/${USER}:/workspace nvcr.io/nvidia/pytorch:23.06-py3

In Node 1, I run examples/pretrain_gpt_distributed_with_mp.sh (from the Megatron repo) with the following hyperparameters:

MASTER_ADDR="<IP_address_of_node_1>"
MASTER_PORT=6000
NNODES=2
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

In Node 2, I run a similar script with the following hyperprameters: (basically the same except for NODE_RANK)

MASTER_ADDR="<IP_address_of_node_1>"
MASTER_PORT=6000
NNODES=2
NODE_RANK=1
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

The scripts above work fine outside of Docker, but when I run it inside Docker, I get an error saying NCCL INFO NET/IB : No device found and

Traceback (most recent call last):
  File "/workspace/Megatron-LM/pretrain_gpt.py", line 119, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/workspace/Megatron-LM/megatron/training.py", line 90, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/Megatron-LM/megatron/initialize.py", line 86, in initialize_megatron
    _compile_dependencies()
  File "/workspace/Megatron-LM/megatron/initialize.py", line 150, in _compile_dependencies
    torch.distributed.barrier()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3646, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1182, internal error - please report this issue to the NCCL developers, NCCL version 2.18.3
ncclInternalError: Internal check failed.
Last error:
Socket recv failed while polling for opId=0x7ffea68e01c0

Would appreciate any thoughts on how to fix this. Thank you!

sedrick-keh-tri commented 1 year ago

If it helps, the full error message is below:

<cluster_name>:924:924 [5] NCCL INFO cudaDriverVersion 12010
<cluster_name>:921:921 [2] NCCL INFO cudaDriverVersion 12010
<cluster_name>:920:920 [1] NCCL INFO cudaDriverVersion 12010
<cluster_name>:922:922 [3] NCCL INFO cudaDriverVersion 12010
<cluster_name>:925:925 [6] NCCL INFO cudaDriverVersion 12010
<cluster_name>:919:919 [0] NCCL INFO cudaDriverVersion 12010
<cluster_name>:923:923 [4] NCCL INFO cudaDriverVersion 12010
<cluster_name>:926:926 [7] NCCL INFO cudaDriverVersion 12010
<cluster_name>:925:925 [6] NCCL INFO Bootstrap : Using ibp12s0:10.149.0.32<0>
<cluster_name>:921:921 [2] NCCL INFO Bootstrap : Using ibp12s0:10.149.0.32<0>
<cluster_name>:923:923 [4] NCCL INFO Bootstrap : Using ibp12s0:10.149.0.32<0>
<cluster_name>:924:924 [5] NCCL INFO Bootstrap : Using ibp12s0:10.149.0.32<0>
<cluster_name>:920:920 [1] NCCL INFO Bootstrap : Using ibp12s0:10.149.0.32<0>
<cluster_name>:922:922 [3] NCCL INFO Bootstrap : Using ibp12s0:10.149.0.32<0>
<cluster_name>:919:919 [0] NCCL INFO Bootstrap : Using ibp12s0:10.149.0.32<0>
<cluster_name>:926:926 [7] NCCL INFO Bootstrap : Using ibp12s0:10.149.0.32<0>
<cluster_name>:924:1152 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
<cluster_name>:924:1152 [5] NCCL INFO P2P plugin IBext
<cluster_name>:920:1145 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
<cluster_name>:920:1145 [1] NCCL INFO P2P plugin IBext
<cluster_name>:921:1146 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
<cluster_name>:921:1146 [2] NCCL INFO P2P plugin IBext
<cluster_name>:926:1151 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
<cluster_name>:926:1151 [7] NCCL INFO P2P plugin IBext
<cluster_name>:925:1148 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
<cluster_name>:925:1148 [6] NCCL INFO P2P plugin IBext
<cluster_name>:919:1150 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
<cluster_name>:919:1150 [0] NCCL INFO P2P plugin IBext
<cluster_name>:922:1149 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
<cluster_name>:922:1149 [3] NCCL INFO P2P plugin IBext
<cluster_name>:923:1147 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
<cluster_name>:923:1147 [4] NCCL INFO P2P plugin IBext
<cluster_name>:924:1152 [5] NCCL INFO NET/IB : No device found.
<cluster_name>:920:1145 [1] NCCL INFO NET/IB : No device found.
<cluster_name>:921:1146 [2] NCCL INFO NET/IB : No device found.
<cluster_name>:926:1151 [7] NCCL INFO NET/IB : No device found.
<cluster_name>:925:1148 [6] NCCL INFO NET/IB : No device found.
<cluster_name>:919:1150 [0] NCCL INFO NET/IB : No device found.
<cluster_name>:922:1149 [3] NCCL INFO NET/IB : No device found.
<cluster_name>:923:1147 [4] NCCL INFO NET/IB : No device found.
<cluster_name>:924:1152 [5] NCCL INFO NET/IB : No device found.
<cluster_name>:920:1145 [1] NCCL INFO NET/IB : No device found.
<cluster_name>:921:1146 [2] NCCL INFO NET/IB : No device found.
<cluster_name>:926:1151 [7] NCCL INFO NET/IB : No device found.
<cluster_name>:925:1148 [6] NCCL INFO NET/IB : No device found.
<cluster_name>:922:1149 [3] NCCL INFO NET/IB : No device found.
<cluster_name>:919:1150 [0] NCCL INFO NET/IB : No device found.
<cluster_name>:923:1147 [4] NCCL INFO NET/IB : No device found.
<cluster_name>:920:1145 [1] NCCL INFO NET/Socket : Using [0]ibp12s0:10.149.0.32<0> [1]ibp75s0:10.149.1.32<0> [2]ibp141s0:10.149.2.32<0> [3]ibp186s0:10.149.3.32<0>
<cluster_name>:920:1145 [1] NCCL INFO Using network Socket
<cluster_name>:924:1152 [5] NCCL INFO NET/Socket : Using [0]ibp12s0:10.149.0.32<0> [1]ibp75s0:10.149.1.32<0> [2]ibp141s0:10.149.2.32<0> [3]ibp186s0:10.149.3.32<0>
<cluster_name>:924:1152 [5] NCCL INFO Using network Socket
<cluster_name>:921:1146 [2] NCCL INFO NET/Socket : Using [0]ibp12s0:10.149.0.32<0> [1]ibp75s0:10.149.1.32<0> [2]ibp141s0:10.149.2.32<0> [3]ibp186s0:10.149.3.32<0>
<cluster_name>:921:1146 [2] NCCL INFO Using network Socket
<cluster_name>:926:1151 [7] NCCL INFO NET/Socket : Using [0]ibp12s0:10.149.0.32<0> [1]ibp75s0:10.149.1.32<0> [2]ibp141s0:10.149.2.32<0> [3]ibp186s0:10.149.3.32<0>
<cluster_name>:926:1151 [7] NCCL INFO Using network Socket
<cluster_name>:925:1148 [6] NCCL INFO NET/Socket : Using [0]ibp12s0:10.149.0.32<0> [1]ibp75s0:10.149.1.32<0> [2]ibp141s0:10.149.2.32<0> [3]ibp186s0:10.149.3.32<0>
<cluster_name>:925:1148 [6] NCCL INFO Using network Socket
<cluster_name>:922:1149 [3] NCCL INFO NET/Socket : Using [0]ibp12s0:10.149.0.32<0> [1]ibp75s0:10.149.1.32<0> [2]ibp141s0:10.149.2.32<0> [3]ibp186s0:10.149.3.32<0>
<cluster_name>:922:1149 [3] NCCL INFO Using network Socket
<cluster_name>:919:1150 [0] NCCL INFO NET/Socket : Using [0]ibp12s0:10.149.0.32<0> [1]ibp75s0:10.149.1.32<0> [2]ibp141s0:10.149.2.32<0> [3]ibp186s0:10.149.3.32<0>
<cluster_name>:923:1147 [4] NCCL INFO NET/Socket : Using [0]ibp12s0:10.149.0.32<0> [1]ibp75s0:10.149.1.32<0> [2]ibp141s0:10.149.2.32<0> [3]ibp186s0:10.149.3.32<0>
<cluster_name>:919:1150 [0] NCCL INFO Using network Socket
<cluster_name>:923:1147 [4] NCCL INFO Using network Socket
<cluster_name>:925:1148 [6] NCCL INFO comm 0x55555d812370 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId b7000 commId 0xf03f81dc1a49efab - Init START
<cluster_name>:926:1151 [7] NCCL INFO comm 0x55555d83f560 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId bd000 commId 0xf03f81dc1a49efab - Init START
<cluster_name>:924:1152 [5] NCCL INFO comm 0x55555d810f00 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId 90000 commId 0xf03f81dc1a49efab - Init START
<cluster_name>:923:1147 [4] NCCL INFO comm 0x55555d811a40 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId 87000 commId 0xf03f81dc1a49efab - Init START
<cluster_name>:922:1149 [3] NCCL INFO comm 0x55555d812aa0 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 4e000 commId 0xf03f81dc1a49efab - Init START
<cluster_name>:921:1146 [2] NCCL INFO comm 0x55555d811870 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 47000 commId 0xf03f81dc1a49efab - Init START
<cluster_name>:919:1150 [0] NCCL INFO comm 0x55555d813260 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 7000 commId 0xf03f81dc1a49efab - Init START
<cluster_name>:920:1145 [1] NCCL INFO comm 0x55555d812130 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId f000 commId 0xf03f81dc1a49efab - Init START
<cluster_name>:923:1147 [4] NCCL INFO Setting affinity for GPU 4 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000,00000000
<cluster_name>:923:1147 [4] NCCL INFO NVLS multicast support is not available on dev 4
<cluster_name>:926:1151 [7] NCCL INFO Setting affinity for GPU 7 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
<cluster_name>:926:1151 [7] NCCL INFO NVLS multicast support is not available on dev 7
<cluster_name>:925:1148 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
<cluster_name>:919:1150 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
<cluster_name>:919:1150 [0] NCCL INFO NVLS multicast support is not available on dev 0
<cluster_name>:921:1146 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,00000000,00000000,00000000,ffff0000
<cluster_name>:924:1152 [5] NCCL INFO Setting affinity for GPU 5 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000,00000000
<cluster_name>:921:1146 [2] NCCL INFO NVLS multicast support is not available on dev 2
<cluster_name>:924:1152 [5] NCCL INFO NVLS multicast support is not available on dev 5
<cluster_name>:925:1148 [6] NCCL INFO NVLS multicast support is not available on dev 6
<cluster_name>:922:1149 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,00000000,00000000,00000000,ffff0000
<cluster_name>:920:1145 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
<cluster_name>:920:1145 [1] NCCL INFO NVLS multicast support is not available on dev 1
<cluster_name>:922:1149 [3] NCCL INFO NVLS multicast support is not available on dev 3
<cluster_name>:919:1150 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] 9/-1/-1->8->15 [2] 9/-1/-1->8->15 [3] 9/-1/-1->8->15 [4] 9/0/-1->8->-1 [5] 9/-1/-1->8->15 [6] 9/-1/-1->8->15 [7] 9/-1/-1->8->15
<cluster_name>:919:1150 [0] NCCL INFO P2P Chunksize set to 131072
<cluster_name>:920:1145 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] -1/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] 10/-1/-1->9->8 [4] 10/-1/-1->9->8 [5] -1/-1/-1->9->8 [6] 10/-1/-1->9->8 [7] 10/-1/-1->9->8
<cluster_name>:920:1145 [1] NCCL INFO P2P Chunksize set to 131072
<cluster_name>:925:1148 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->6 [4] 15/-1/-1->14->13 [5] 15/-1/-1->14->13 [6] 15/-1/-1->14->13 [7] 15/6/-1->14->-1
<cluster_name>:925:1148 [6] NCCL INFO P2P Chunksize set to 131072
<cluster_name>:923:1147 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->4 [3] 13/-1/-1->12->11 [4] 13/-1/-1->12->11 [5] 13/-1/-1->12->11 [6] 13/4/-1->12->-1 [7] 13/-1/-1->12->11
<cluster_name>:921:1146 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->2 [2] 11/-1/-1->10->9 [3] 11/-1/-1->10->9 [4] 11/-1/-1->10->9 [5] 11/2/-1->10->-1 [6] 11/-1/-1->10->9 [7] 11/-1/-1->10->9
<cluster_name>:922:1149 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] -1/-1/-1->11->10 [3] 12/-1/-1->11->10 [4] 12/-1/-1->11->10 [5] 12/-1/-1->11->10 [6] -1/-1/-1->11->10 [7] 12/-1/-1->11->10
<cluster_name>:923:1147 [4] NCCL INFO P2P Chunksize set to 131072
<cluster_name>:921:1146 [2] NCCL INFO P2P Chunksize set to 131072
<cluster_name>:924:1152 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] -1/-1/-1->13->12 [4] 14/-1/-1->13->12 [5] 14/-1/-1->13->12 [6] 14/-1/-1->13->12 [7] -1/-1/-1->13->12
<cluster_name>:926:1151 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] 8/-1/-1->15->14 [3] 8/-1/-1->15->14 [4] -1/-1/-1->15->14 [5] 8/-1/-1->15->14 [6] 8/-1/-1->15->14 [7] 8/-1/-1->15->14
<cluster_name>:922:1149 [3] NCCL INFO P2P Chunksize set to 131072
<cluster_name>:924:1152 [5] NCCL INFO P2P Chunksize set to 131072
<cluster_name>:926:1151 [7] NCCL INFO P2P Chunksize set to 131072
<cluster_name>:919:1150 [0] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [receive] via NET/Socket/0
<cluster_name>:919:1150 [0] NCCL INFO Channel 04/0 : 1[1] -> 8[0] [receive] via NET/Socket/0
<cluster_name>:919:1150 [0] NCCL INFO Channel 00/0 : 8[0] -> 15[7] via P2P/IPC/read
<cluster_name>:923:1147 [4] NCCL INFO Channel 02/0 : 5[5] -> 12[4] [receive] via NET/Socket/2
<cluster_name>:920:1145 [1] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [send] via NET/Socket/0
<cluster_name>:925:1148 [6] NCCL INFO Channel 03/0 : 7[7] -> 14[6] [receive] via NET/Socket/3
<cluster_name>:920:1145 [1] NCCL INFO Channel 04/0 : 9[1] -> 0[0] [send] via NET/Socket/0
<cluster_name>:923:1147 [4] NCCL INFO Channel 06/0 : 5[5] -> 12[4] [receive] via NET/Socket/2
<cluster_name>:921:1146 [2] NCCL INFO Channel 01/0 : 3[3] -> 10[2] [receive] via NET/Socket/1
<cluster_name>:925:1148 [6] NCCL INFO Channel 07/0 : 7[7] -> 14[6] [receive] via NET/Socket/3
<cluster_name>:922:1149 [3] NCCL INFO Channel 01/0 : 11[3] -> 2[2] [send] via NET/Socket/1
<cluster_name>:924:1152 [5] NCCL INFO Channel 02/0 : 13[5] -> 4[4] [send] via NET/Socket/2
<cluster_name>:921:1146 [2] NCCL INFO Channel 05/0 : 3[3] -> 10[2] [receive] via NET/Socket/1
<cluster_name>:922:1149 [3] NCCL INFO Channel 05/0 : 11[3] -> 2[2] [send] via NET/Socket/1
<cluster_name>:924:1152 [5] NCCL INFO Channel 06/0 : 13[5] -> 4[4] [send] via NET/Socket/2
<cluster_name>:919:1150 [0] NCCL INFO Channel 01/0 : 8[0] -> 15[7] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 02/0 : 8[0] -> 15[7] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 03/0 : 8[0] -> 15[7] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 04/0 : 8[0] -> 15[7] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 05/0 : 8[0] -> 15[7] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 06/0 : 8[0] -> 15[7] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 07/0 : 8[0] -> 15[7] via P2P/IPC/read
<cluster_name>:926:1151 [7] NCCL INFO Channel 03/0 : 15[7] -> 6[6] [send] via NET/Socket/3
<cluster_name>:926:1151 [7] NCCL INFO Channel 07/0 : 15[7] -> 6[6] [send] via NET/Socket/3
<cluster_name>:921:1146 [2] NCCL INFO Channel 00/0 : 10[2] -> 9[1] via P2P/IPC/read
<cluster_name>:923:1147 [4] NCCL INFO Channel 00/0 : 12[4] -> 11[3] via P2P/IPC/read
<cluster_name>:926:1151 [7] NCCL INFO Channel 00/0 : 15[7] -> 14[6] via P2P/IPC/read
<cluster_name>:923:1147 [4] NCCL INFO Channel 01/0 : 12[4] -> 11[3] via P2P/IPC/read
<cluster_name>:921:1146 [2] NCCL INFO Channel 01/0 : 10[2] -> 9[1] via P2P/IPC/read
<cluster_name>:926:1151 [7] NCCL INFO Channel 01/0 : 15[7] -> 14[6] via P2P/IPC/read
<cluster_name>:924:1152 [5] NCCL INFO Channel 00/0 : 13[5] -> 12[4] via P2P/IPC/read
<cluster_name>:921:1146 [2] NCCL INFO Channel 02/0 : 10[2] -> 9[1] via P2P/IPC/read
<cluster_name>:923:1147 [4] NCCL INFO Channel 02/0 : 12[4] -> 11[3] via P2P/IPC/read
<cluster_name>:926:1151 [7] NCCL INFO Channel 02/0 : 15[7] -> 14[6] via P2P/IPC/read
<cluster_name>:922:1149 [3] NCCL INFO Channel 00/0 : 11[3] -> 10[2] via P2P/IPC/read
<cluster_name>:924:1152 [5] NCCL INFO Channel 01/0 : 13[5] -> 12[4] via P2P/IPC/read
<cluster_name>:921:1146 [2] NCCL INFO Channel 03/0 : 10[2] -> 9[1] via P2P/IPC/read
<cluster_name>:923:1147 [4] NCCL INFO Channel 03/0 : 12[4] -> 11[3] via P2P/IPC/read
<cluster_name>:926:1151 [7] NCCL INFO Channel 04/0 : 15[7] -> 14[6] via P2P/IPC/read
<cluster_name>:922:1149 [3] NCCL INFO Channel 02/0 : 11[3] -> 10[2] via P2P/IPC/read
<cluster_name>:924:1152 [5] NCCL INFO Channel 03/0 : 13[5] -> 12[4] via P2P/IPC/read
<cluster_name>:921:1146 [2] NCCL INFO Channel 04/0 : 10[2] -> 9[1] via P2P/IPC/read
<cluster_name>:923:1147 [4] NCCL INFO Channel 04/0 : 12[4] -> 11[3] via P2P/IPC/read
<cluster_name>:926:1151 [7] NCCL INFO Channel 05/0 : 15[7] -> 14[6] via P2P/IPC/read
<cluster_name>:922:1149 [3] NCCL INFO Channel 03/0 : 11[3] -> 10[2] via P2P/IPC/read
<cluster_name>:924:1152 [5] NCCL INFO Channel 04/0 : 13[5] -> 12[4] via P2P/IPC/read
<cluster_name>:921:1146 [2] NCCL INFO Channel 05/0 : 10[2] -> 9[1] via P2P/IPC/read
<cluster_name>:923:1147 [4] NCCL INFO Channel 05/0 : 12[4] -> 11[3] via P2P/IPC/read
<cluster_name>:926:1151 [7] NCCL INFO Channel 06/0 : 15[7] -> 14[6] via P2P/IPC/read
<cluster_name>:922:1149 [3] NCCL INFO Channel 04/0 : 11[3] -> 10[2] via P2P/IPC/read
<cluster_name>:924:1152 [5] NCCL INFO Channel 05/0 : 13[5] -> 12[4] via P2P/IPC/read
<cluster_name>:921:1146 [2] NCCL INFO Channel 06/0 : 10[2] -> 9[1] via P2P/IPC/read
<cluster_name>:923:1147 [4] NCCL INFO Channel 06/0 : 12[4] -> 11[3] via P2P/IPC/read
<cluster_name>:922:1149 [3] NCCL INFO Channel 06/0 : 11[3] -> 10[2] via P2P/IPC/read
<cluster_name>:924:1152 [5] NCCL INFO Channel 07/0 : 13[5] -> 12[4] via P2P/IPC/read
<cluster_name>:921:1146 [2] NCCL INFO Channel 07/0 : 10[2] -> 9[1] via P2P/IPC/read
<cluster_name>:923:1147 [4] NCCL INFO Channel 07/0 : 12[4] -> 11[3] via P2P/IPC/read
<cluster_name>:922:1149 [3] NCCL INFO Channel 07/0 : 11[3] -> 10[2] via P2P/IPC/read
<cluster_name>:925:1148 [6] NCCL INFO Channel 00/0 : 14[6] -> 13[5] via P2P/IPC/read
<cluster_name>:925:1148 [6] NCCL INFO Channel 01/0 : 14[6] -> 13[5] via P2P/IPC/read
<cluster_name>:925:1148 [6] NCCL INFO Channel 02/0 : 14[6] -> 13[5] via P2P/IPC/read
<cluster_name>:925:1148 [6] NCCL INFO Channel 03/0 : 14[6] -> 13[5] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 01/0 : 9[1] -> 8[0] via P2P/IPC/read
<cluster_name>:925:1148 [6] NCCL INFO Channel 04/0 : 14[6] -> 13[5] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 02/0 : 9[1] -> 8[0] via P2P/IPC/read
<cluster_name>:925:1148 [6] NCCL INFO Channel 05/0 : 14[6] -> 13[5] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 03/0 : 9[1] -> 8[0] via P2P/IPC/read
<cluster_name>:925:1148 [6] NCCL INFO Channel 06/0 : 14[6] -> 13[5] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 05/0 : 9[1] -> 8[0] via P2P/IPC/read
<cluster_name>:925:1148 [6] NCCL INFO Channel 07/0 : 14[6] -> 13[5] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 06/0 : 9[1] -> 8[0] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 07/0 : 9[1] -> 8[0] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Connected all rings
<cluster_name>:919:1150 [0] NCCL INFO Connected all rings
<cluster_name>:919:1150 [0] NCCL INFO Channel 00/0 : 8[0] -> 9[1] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 01/0 : 8[0] -> 9[1] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 02/0 : 8[0] -> 9[1] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 03/0 : 8[0] -> 9[1] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 04/0 : 8[0] -> 9[1] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 05/0 : 8[0] -> 9[1] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 06/0 : 8[0] -> 9[1] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 07/0 : 8[0] -> 9[1] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 00/0 : 9[1] -> 10[2] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 02/0 : 9[1] -> 10[2] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 03/0 : 9[1] -> 10[2] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 04/0 : 9[1] -> 10[2] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 06/0 : 9[1] -> 10[2] via P2P/IPC/read
<cluster_name>:920:1145 [1] NCCL INFO Channel 07/0 : 9[1] -> 10[2] via P2P/IPC/read
<cluster_name>:919:1150 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [receive] via NET/Socket/0
<cluster_name>:919:1150 [0] NCCL INFO Channel 04/0 : 0[0] -> 8[0] [receive] via NET/Socket/0
<cluster_name>:919:1150 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [send] via NET/Socket/0
<cluster_name>:919:1150 [0] NCCL INFO Channel 04/0 : 8[0] -> 0[0] [send] via NET/Socket/0
<cluster_name>:922:1158 [3] NCCL INFO misc/socket.cc:567 -> 2
<cluster_name>:922:1158 [3] NCCL INFO misc/socket.cc:586 -> 2
<cluster_name>:922:1158 [3] NCCL INFO transport/net_socket.cc:336 -> 2
<cluster_name>:922:1158 [3] NCCL INFO transport/net.cc:592 -> 2
<cluster_name>:922:1158 [3] NCCL INFO proxy.cc:1306 -> 2

<cluster_name>:922:1158 [3] proxy.cc:1485 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=2, closing connection

<cluster_name>:922:1158 [3] proxy.cc:1519 NCCL WARN [Proxy Service 11] Failed to execute operation Connect from rank 11, retcode 2

<cluster_name>:922:1149 [3] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer <cluster_name>.ibnet<53735>
<cluster_name>:922:1149 [3] NCCL INFO misc/socket.cc:749 -> 6

<cluster_name>:922:1149 [3] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7ffea68dcba0
<cluster_name>:922:1149 [3] NCCL INFO transport/net.cc:288 -> 3
<cluster_name>:922:1149 [3] NCCL INFO transport.cc:148 -> 3
<cluster_name>:922:1149 [3] NCCL INFO init.cc:1079 -> 3
<cluster_name>:922:1149 [3] NCCL INFO init.cc:1358 -> 3
<cluster_name>:922:1149 [3] NCCL INFO group.cc:65 -> 3 [Async thread]
<cluster_name>:922:922 [3] NCCL INFO group.cc:406 -> 3
<cluster_name>:922:922 [3] NCCL INFO group.cc:96 -> 3
<cluster_name>:922:922 [3] NCCL INFO comm 0x55555d812aa0 rank 11 nranks 16 cudaDev 3 busId 4e000 - Abort COMPLETE
Traceback (most recent call last):
  File "/workspace/Megatron-LM/pretrain_gpt.py", line 119, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/workspace/Megatron-LM/megatron/training.py", line 90, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/Megatron-LM/megatron/initialize.py", line 86, in initialize_megatron
    _compile_dependencies()
  File "/workspace/Megatron-LM/megatron/initialize.py", line 150, in _compile_dependencies
    torch.distributed.barrier()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3646, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1182, internal error - please report this issue to the NCCL developers, NCCL version 2.18.3
ncclInternalError: Internal check failed.
Last error:
Socket recv failed while polling for opId=0x7ffea68dcba0
<cluster_name>:924:1159 [5] NCCL INFO misc/socket.cc:567 -> 2
<cluster_name>:924:1159 [5] NCCL INFO misc/socket.cc:586 -> 2
<cluster_name>:924:1159 [5] NCCL INFO transport/net_socket.cc:336 -> 2
<cluster_name>:924:1159 [5] NCCL INFO transport/net.cc:592 -> 2
<cluster_name>:924:1159 [5] NCCL INFO proxy.cc:1306 -> 2

<cluster_name>:924:1159 [5] proxy.cc:1485 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=2, closing connection

<cluster_name>:924:1159 [5] proxy.cc:1519 NCCL WARN [Proxy Service 13] Failed to execute operation Connect from rank 13, retcode 2

<cluster_name>:924:1152 [5] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer <cluster_name>.ibnet<39123>
<cluster_name>:924:1152 [5] NCCL INFO misc/socket.cc:749 -> 6

<cluster_name>:924:1152 [5] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7ffea68e01c0
<cluster_name>:924:1152 [5] NCCL INFO transport/net.cc:288 -> 3
<cluster_name>:924:1152 [5] NCCL INFO transport.cc:148 -> 3
<cluster_name>:924:1152 [5] NCCL INFO init.cc:1079 -> 3
<cluster_name>:924:1152 [5] NCCL INFO init.cc:1358 -> 3
<cluster_name>:924:1152 [5] NCCL INFO group.cc:65 -> 3 [Async thread]
<cluster_name>:924:924 [5] NCCL INFO group.cc:406 -> 3
<cluster_name>:924:924 [5] NCCL INFO group.cc:96 -> 3
<cluster_name>:926:1160 [7] NCCL INFO misc/socket.cc:567 -> 2
<cluster_name>:926:1160 [7] NCCL INFO misc/socket.cc:586 -> 2
<cluster_name>:926:1160 [7] NCCL INFO transport/net_socket.cc:336 -> 2
<cluster_name>:926:1160 [7] NCCL INFO transport/net.cc:592 -> 2
<cluster_name>:926:1160 [7] NCCL INFO proxy.cc:1306 -> 2

<cluster_name>:926:1160 [7] proxy.cc:1485 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=2, closing connection

<cluster_name>:926:1160 [7] proxy.cc:1519 NCCL WARN [Proxy Service 15] Failed to execute operation Connect from rank 15, retcode 2

<cluster_name>:926:1151 [7] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer <cluster_name>.ibnet<49785>
<cluster_name>:926:1151 [7] NCCL INFO misc/socket.cc:749 -> 6

<cluster_name>:926:1151 [7] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7ffea28d71a0
<cluster_name>:926:1151 [7] NCCL INFO transport/net.cc:288 -> 3
<cluster_name>:926:1151 [7] NCCL INFO transport.cc:148 -> 3
<cluster_name>:926:1151 [7] NCCL INFO init.cc:1079 -> 3
<cluster_name>:926:1151 [7] NCCL INFO init.cc:1358 -> 3
<cluster_name>:926:1151 [7] NCCL INFO group.cc:65 -> 3 [Async thread]
<cluster_name>:926:926 [7] NCCL INFO group.cc:406 -> 3
<cluster_name>:926:926 [7] NCCL INFO group.cc:96 -> 3
<cluster_name>:924:924 [5] NCCL INFO comm 0x55555d810f00 rank 13 nranks 16 cudaDev 5 busId 90000 - Abort COMPLETE
Traceback (most recent call last):
  File "/workspace/Megatron-LM/pretrain_gpt.py", line 119, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/workspace/Megatron-LM/megatron/training.py", line 90, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/Megatron-LM/megatron/initialize.py", line 86, in initialize_megatron
    _compile_dependencies()
  File "/workspace/Megatron-LM/megatron/initialize.py", line 150, in _compile_dependencies
    torch.distributed.barrier()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3646, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1182, internal error - please report this issue to the NCCL developers, NCCL version 2.18.3
ncclInternalError: Internal check failed.
Last error:
Socket recv failed while polling for opId=0x7ffea68e01c0
<cluster_name>:926:926 [7] NCCL INFO comm 0x55555d83f560 rank 15 nranks 16 cudaDev 7 busId bd000 - Abort COMPLETE
Traceback (most recent call last):
  File "/workspace/Megatron-LM/pretrain_gpt.py", line 119, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/workspace/Megatron-LM/megatron/training.py", line 90, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/Megatron-LM/megatron/initialize.py", line 86, in initialize_megatron
    _compile_dependencies()
  File "/workspace/Megatron-LM/megatron/initialize.py", line 150, in _compile_dependencies
    torch.distributed.barrier()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3646, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1182, internal error - please report this issue to the NCCL developers, NCCL version 2.18.3
ncclInternalError: Internal check failed.
Last error:
Socket recv failed while polling for opId=0x7ffea28d71a0
sjeaugey commented 1 year ago

It looks like you don't have IB NICs inside your container indeed. You may want to take a look at that page as to how to include IB NICs inside docker: https://docs.nvidia.com/networking/m/view-rendered-page.action?abstractPageId=15049785

Now, the crash would probably be the same even with IB; your network communication doesn't seem to work. You're using the ibp12s0 to communicate between processes; could it be there is a firewall preventing communication on that interface? If you're running only on a single node (I'm not sure as the log seems to only show one node), you can try setting NCCL_SOCKET_IFNAME=lo to see whether that's the issue (or course that won't work between nodes).

premmotgi commented 1 year ago

I'm having the same issue. I'm using srun to run the container with pyxis. Is there any way network can be specified with srun command?

Kk1984up commented 10 months ago

I have the same issue,how do you fix it,

thincal commented 6 months ago

@premmotgi @Kk1984up what's the OS version used ?