Open troelsfr opened 6 months ago
You would need to run with NCCL_DEBUG=WARN
or NCCL_DEBUG=INFO
for us to be able to offer any advice on this issue.
But normally these are caused by some system or network configuration issues.
i meet same error, segmentation fault when NCCL_CHECK. Any solutions?
INFO 10-27 07:48:16 pynccl.py:63] vLLM is using nccl==2.20.5
master:608:608 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens121f1
master:608:608 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ens121f1
master:608:608 [0] NCCL INFO Bootstrap : Using ens121f1:192.168.10.13<0>
master:608:608 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
master:608:608 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.4
master:608:608 [0] NCCL INFO init.cc:1732 Cuda Host Alloc Size 4 pointer 0x73a8ab800000
(RayWorkerWrapper pid=284, ip=192.168.31.49) INFO 10-27 07:48:16 utils.py:977] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=284, ip=192.168.31.49) INFO 10-27 07:48:16 pynccl.py:63] vLLM is using nccl==2.20.5
master:608:608 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens121f1
master:608:608 [0] NCCL INFO NCCL_IB_HCA set to rxe0
master:608:608 [0] transport/net_ib.cc:115 NCCL WARN Could not find real path of rxe0 (/sys/class/infiniband/rxe0/device)
master:608:608 [0] NCCL INFO NET/IB : Using [0]rxe0:1/RoCE [RO]; OOB ens121f1:192.168.10.13<0>
master:608:608 [0] NCCL INFO Using non-device net plugin version 0
master:608:608 [0] NCCL INFO Using network IB
master:608:608 [0] NCCL INFO comm 0xb9dfd00 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 36000 commId 0x44748f6c8a51311c - Init START
master:608:608 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
master:608:608 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'rxe0'
master:608:608 [0] NCCL INFO === System : maxBw 1.2 totalBw 24.0 ===
master:608:608 [0] NCCL INFO CPU/0 (1/1/2)
master:608:608 [0] NCCL INFO + PCI[5000.0] - NIC/0
master:608:608 [0] NCCL INFO + NET[1.2] - NET/0 (72585afeff9196b6/1/1.250000)
master:608:608 [0] NCCL INFO + PCI[24.0] - PCI/31000 (11f8400011f8beef)
master:608:608 [0] NCCL INFO + PCI[24.0] - GPU/36000 (0)
master:608:608 [0] NCCL INFO ==========================================
master:608:608 [0] NCCL INFO GPU/36000 :GPU/36000 (0/5000.000000/LOC) CPU/0 (2/24.000000/PHB) NET/0 (4/1.250000/PHB)
master:608:608 [0] NCCL INFO NET/0 :GPU/36000 (4/1.250000/PHB) CPU/0 (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
master:608:608 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ff000000,0fffffff
master:608:608 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 1.200000/1.200000, type LOC/PHB, sameChannels 1
master:608:608 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0
master:608:608 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 2.400000/1.200000, type LOC/PHB, sameChannels 1
master:608:608 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0
master:608:608 [0] NCCL INFO comm 0xb9dfd00 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
master:608:608 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
master:608:608 [0] NCCL INFO Tree 1 : 1 -> 0 -> -1/-1/-1
master:608:608 [0] NCCL INFO Channel 00/02 : 0 1
master:608:608 [0] NCCL INFO Channel 01/02 : 0 1
master:608:608 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1
master:608:608 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1
master:608:608 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
master:608:608 [0] NCCL INFO P2P Chunksize set to 131072
master:608:608 [0] NCCL INFO UDS: Creating service thread comm 0xb9dfd00 rank 0
master:608:608 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
master:608:608 [0] NCCL INFO channel.cc:40 Cuda Alloc Size 1152 pointer 0x73a85d400000
master:608:608 [0] NCCL INFO channel.cc:43 Cuda Alloc Size 32 pointer 0x73a85d400600
master:608:608 [0] NCCL INFO channel.cc:54 Cuda Alloc Size 8 pointer 0x73a85d400800
master:608:608 [0] NCCL INFO channel.cc:40 Cuda Alloc Size 1152 pointer 0x73a85d400a00
master:608:608 [0] NCCL INFO channel.cc:43 Cuda Alloc Size 32 pointer 0x73a85d401000
master:608:608 [0] NCCL INFO channel.cc:54 Cuda Alloc Size 8 pointer 0x73a85d401200
master:608:7169 [0] NCCL INFO Mem Realloc old size 0, new size 8 pointer 0x73a868000be0
master:608:7169 [0] NCCL INFO Allocated 5767524 bytes of shared memory in /dev/shm/nccl-2iqVjb
master:608:7169 [0] NCCL INFO New proxy recv connection 0 from local rank 0, transport 2
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=1 op.reqBuff=0x73a868000ba0 op.respSize=16 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Init res=0
master:608:608 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x73a868004f40
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=3 op.reqBuff=0x73a868008f20 op.respSize=128 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Setup res=0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IB/0
master:608:7169 [0] NCCL INFO New proxy recv connection 1 from local rank 0, transport 2
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=1 op.reqBuff=0x73a86800e370 op.respSize=16 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Init res=0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x73a868004fb8
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=3 op.reqBuff=0x73a86800e3b0 op.respSize=128 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Setup res=0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IB/0
master:608:7169 [0] NCCL INFO New proxy send connection 2 from local rank 0, transport 2
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=1 op.reqBuff=0x73a868013800 op.respSize=16 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Init res=0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x73a868005030
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=3 op.reqBuff=0x73a868013840 op.respSize=0 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Setup res=0
master:608:608 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IB/0
master:608:7169 [0] NCCL INFO New proxy send connection 3 from local rank 0, transport 2
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=1 op.reqBuff=0x73a868018b60 op.respSize=16 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Init res=0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x73a8680050a8
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=3 op.reqBuff=0x73a868018ba0 op.respSize=0 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IB/0
master:608:7169 [0] NCCL INFO Received and initiated operation=Setup res=0
master:608:608 [0] NCCL INFO sendConnect ncclProxyCallAsync opId=0xf1cb6e8
master:608:608 [0] NCCL INFO recvConnect ncclProxyCallAsync opId=0xf1cb878 &recv->proxyConn=0xf1cb880 connectInfo=0xf1df690
master:608:608 [0] NCCL INFO sendConnect ncclProxyCallAsync opId=0xf1dd668
master:608:608 [0] NCCL INFO recvConnect ncclProxyCallAsync opId=0xf1dd7f8 &recv->proxyConn=0xf1dd800 connectInfo=0xf1df710
master:608:7169 [0] NCCL INFO transport/net_ib.cc:773 Ib Alloc Size 181536 pointer 0x73a868024000
master:608:7169 [0] NCCL INFO Call to ibv_query_ece skipped, internal_name doesn't exist
master:608:7169 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
master:608:7169 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 30 mtu 3 query_ece={supported=0, vendor_id=0x0, options=0x0, comp_mask=0x0} GID 3 (0/0) fifoRkey=0xa11 fifoLkey=0xa11
master:608:7169 [0] NCCL INFO transport/net_ib.cc:867 Ib Alloc Size 3336 pointer 0x73a868052000
master:608:7169 [0] NCCL INFO Received and initiated operation=Connect res=0
master:608:7169 [0] NCCL INFO transport/net_ib.cc:978 Ib Alloc Size 164416 pointer 0x73a868059000
master:608:7169 [0] NCCL INFO transport/net_ib.cc:991 Ib Alloc Size 3336 pointer 0x73a868083000
master:608:7169 [0] NCCL INFO NCCL_IB_TC set by environment to 106.
master:608:7169 [0] NCCL INFO NCCL_IB_SL set by environment to 3.
master:608:7169 [0] misc/ibvwrap.cc:206 NCCL WARN Call to ibv_modify_qp failed with error No data available errno 61
master:608:7169 [0] NCCL INFO transport/net_ib.cc:725 -> 2
master:608:7169 [0] NCCL INFO transport/net_ib.cc:1067 -> 2
master:608:7169 [0] NCCL INFO transport/net.cc:833 -> 2
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf1cb878 op.type=4 op.reqBuff=0x73a868018b80 op.respSize=21040 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf1cb878
master:608:7169 [0] NCCL INFO Received and initiated operation=Connect res=0
master:608:608 [0] NCCL INFO Queuing opId=0xf1cb878 respBuff=0xf20a890 respSize=21040
master:608:608 [0] NCCL INFO ncclPollProxyResponse Dequeued cached opId=0xf1cb878
master:608:608 [0] NCCL INFO transport/net.cc:402 -> 2
master:608:608 [0] NCCL INFO transport.cc:183 -> 2
master:608:608 [0] NCCL INFO init.cc:1222 -> 2
master:608:608 [0] NCCL INFO init.cc:1501 -> 2
master:608:608 [0] NCCL INFO init.cc:1746 -> 2
master:608:7169 [0] NCCL INFO transport/net_ib.cc:773 Ib Alloc Size 181536 pointer 0x73a868085000
master:608:7169 [0] NCCL INFO Call to ibv_query_ece skipped, internal_name doesn't exist
master:608:7169 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 32 mtu 3 query_ece={supported=0, vendor_id=0x0, options=0x0, comp_mask=0x0} GID 3 (0/0) fifoRkey=0xbdb fifoLkey=0xbdb
master:608:7169 [0] NCCL INFO transport/net_ib.cc:867 Ib Alloc Size 3336 pointer 0x73a8680b3000
*** SIGSEGV received at time=1730040496 on cpu 57 ***
master:608:608 [0] NCCL INFO init.cc:1784 -> 2
PC: @ 0x73ad5a019cfa (unknown) proxyProgressAsync()
@ 0x73ae05026090 (unknown) (unknown)
[2024-10-27 07:48:16,289 E 608 7169] logging.cc:440: *** SIGSEGV received at time=1730040496 on cpu 57 ***
[2024-10-27 07:48:16,289 E 608 7169] logging.cc:440: PC: @ 0x73ad5a019cfa (unknown) proxyProgressAsync()
[2024-10-27 07:48:16,289 E 608 7169] logging.cc:440: @ 0x73ae05026090 (unknown) (unknown)
Fatal Python error: Segmentation fault
ERROR 10-27 07:48:16 worker_base.py:464] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 10-27 07:48:16 worker_base.py:464] Traceback (most recent call last):
ERROR 10-27 07:48:16 worker_base.py:464] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 10-27 07:48:16 worker_base.py:464] return executor(*args, **kwargs)
ERROR 10-27 07:48:16 worker_base.py:464] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 175, in init_device
ERROR 10-27 07:48:16 worker_base.py:464] init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 10-27 07:48:16 worker_base.py:464] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 450, in init_worker_distributed_environment
ERROR 10-27 07:48:16 worker_base.py:464] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
ERROR 10-27 07:48:16 worker_base.py:464] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
ERROR 10-27 07:48:16 worker_base.py:464] initialize_model_parallel(tensor_model_parallel_size,
ERROR 10-27 07:48:16 worker_base.py:464] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 947, in initialize_model_parallel
ERROR 10-27 07:48:16 worker_base.py:464] _PP = init_model_parallel_group(group_ranks,
ERROR 10-27 07:48:16 worker_base.py:464] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
ERROR 10-27 07:48:16 worker_base.py:464] return GroupCoordinator(
ERROR 10-27 07:48:16 worker_base.py:464] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 154, in __init__
ERROR 10-27 07:48:16 worker_base.py:464] self.pynccl_comm = PyNcclCommunicator(
ERROR 10-27 07:48:16 worker_base.py:464] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
ERROR 10-27 07:48:16 worker_base.py:464] self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
ERROR 10-27 07:48:16 worker_base.py:464] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
ERROR 10-27 07:48:16 worker_base.py:464] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
ERROR 10-27 07:48:16 worker_base.py:464] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
ERROR 10-27 07:48:16 worker_base.py:464] raise RuntimeError(f"NCCL error: {error_str}")
ERROR 10-27 07:48:16 worker_base.py:464] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
Process SpawnProcess-1:
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, yaml._yaml, msgspec._core, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, PIL._imaging, PIL._imagingft, regex._regex, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, zmq.backend.cython._zmq, pyarrow.lib, pyarrow._json (total: 45)
ERROR 10-27 07:48:19 api_server.py:186] RPCServer process died before responding to readiness probe
@dwq370 This could be a completely different problem; I recommend that you open a separate issue for it. Still, here's some quick feedback:
The segmentation fault appears to be secondary to this error:
master:608:7169 [0] misc/ibvwrap.cc:206 NCCL WARN Call to ibv_modify_qp failed with error No data available errno 61
How many processes does your NCCL job consist of? The debug output you included appears to be all from a single process (608
on node master
) -- is that the only one? What was the job configuration (number of nodes/processes/GPUs)? I'm asking because the above error could well be the result of an error on another process/node, but we don't see the output from any other NCCL processes...
How to reproduce
You need to machines. On both machines, you install Ray:
Then on the first machine, start a head node:
and then connect the second machine to it:
Bug
After setting up the ray cluster and running:
I get following error (which suggests to file a bug report here)
Version
Here are some of the dependencies that might be relevant: