NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 831 forks source link

Internal error when submitting a job to a Ray cluster #1297

Open troelsfr opened 6 months ago

troelsfr commented 6 months ago

How to reproduce

You need to machines. On both machines, you install Ray:

pip install -U ray vllm hf_transfer torch

Then on the first machine, start a head node:

ray start --head 

and then connect the second machine to it:

ray start --address=[address]

Bug

After setting up the ray cluster and running:

python3 -m vllm.entrypoints.api_server --model facebook/opt-13b --tensor-parallel-size 2

I get following error (which suggests to file a bug report here)

(RayWorkerWrapper pid=246832, ip=141.105.68.142) INFO 05-24 09:59:28 pynccl_utils.py:43] vLLM is using nccl==2.18.1
*** SIGSEGV received at time=1716533968 on cpu 8 ***
ERROR 05-24 09:59:28 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 05-24 09:59:28 worker_base.py:145] Traceback (most recent call last):
ERROR 05-24 09:59:28 worker_base.py:145]   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
ERROR 05-24 09:59:28 worker_base.py:145]     return executor(*args, **kwargs)
ERROR 05-24 09:59:28 worker_base.py:145]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-24 09:59:28 worker_base.py:145]   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/worker/worker.py", line 111, in init_device
ERROR 05-24 09:59:28 worker_base.py:145]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 05-24 09:59:28 worker_base.py:145]   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/worker/worker.py", line 305, in init_worker_distributed_environment
ERROR 05-24 09:59:28 worker_base.py:145]     pynccl_utils.init_process_group(
ERROR 05-24 09:59:28 worker_base.py:145]   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
ERROR 05-24 09:59:28 worker_base.py:145]     comm = NCCLCommunicator(group=group)
ERROR 05-24 09:59:28 worker_base.py:145]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-24 09:59:28 worker_base.py:145]   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 264, in __init__
ERROR 05-24 09:59:28 worker_base.py:145]     NCCL_CHECK(
ERROR 05-24 09:59:28 worker_base.py:145]   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 73, in NCCL_CHECK
ERROR 05-24 09:59:28 worker_base.py:145]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 05-24 09:59:28 worker_base.py:145] RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
PC: @     0x7f329b07e905  (unknown)  ncclProxyService()
    @     0x7f393e63c460       3496  (unknown)
    @       0x11ffffffff  (unknown)  (unknown)
[2024-05-24 09:59:28,596 E 2944052 2946421] logging.cc:365: *** SIGSEGV received at time=1716533968 on cpu 8 ***
[2024-05-24 09:59:28,596 E 2944052 2946421] logging.cc:365: PC: @     0x7f329b07e905  (unknown)  ncclProxyService()
[2024-05-24 09:59:28,597 E 2944052 2946421] logging.cc:365:     @     0x7f393e63c460       3496  (unknown)
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/entrypoints/api_server.py", line 107, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 324, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:                           ^^^^^^^^^^^^^^^
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 300, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 41, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 43, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 164, in _init_workers_ray
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 234, in _run_workers
[rank0]:     driver_worker_output = self.driver_worker.execute_method(
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 146, in execute_method
[rank0]:     raise e
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/worker/worker.py", line 111, in init_device
[rank0]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/worker/worker.py", line 305, in init_worker_distributed_environment
[rank0]:     pynccl_utils.init_process_group(
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
[rank0]:     comm = NCCLCommunicator(group=group)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 264, in __init__
[rank0]:     NCCL_CHECK(
[rank0]:   File "/home/tfr/ray_env/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 73, in NCCL_CHECK
[rank0]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: internal error - please report this issue to the NCCL developers
[2024-05-24 09:59:28,598 E 2944052 2946421] logging.cc:365:     @       0x11ffffffff  (unknown)  (unknown)
Fatal Python error: Segmentation fault

Version

Here are some of the dependencies that might be relevant:

nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.550.52
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
ray==2.23.0
torch==2.3.0
transformers==4.41.1
vllm==0.4.2
vllm-nccl-cu12==2.18.1.0.4.0
AddyLaddy commented 6 months ago

You would need to run with NCCL_DEBUG=WARN or NCCL_DEBUG=INFO for us to be able to offer any advice on this issue. But normally these are caused by some system or network configuration issues.

dwq370 commented 1 month ago

i meet same error, segmentation fault when NCCL_CHECK. Any solutions?

INFO 10-27 07:48:16 pynccl.py:63] vLLM is using nccl==2.20.5
master:608:608 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens121f1
master:608:608 [0] NCCL INFO NCCL_SOCKET_IFNAME set to ens121f1
master:608:608 [0] NCCL INFO Bootstrap : Using ens121f1:192.168.10.13<0>
master:608:608 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
master:608:608 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.4
master:608:608 [0] NCCL INFO init.cc:1732 Cuda Host Alloc Size 4 pointer 0x73a8ab800000
(RayWorkerWrapper pid=284, ip=192.168.31.49) INFO 10-27 07:48:16 utils.py:977] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=284, ip=192.168.31.49) INFO 10-27 07:48:16 pynccl.py:63] vLLM is using nccl==2.20.5
master:608:608 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens121f1
master:608:608 [0] NCCL INFO NCCL_IB_HCA set to rxe0

master:608:608 [0] transport/net_ib.cc:115 NCCL WARN Could not find real path of rxe0 (/sys/class/infiniband/rxe0/device)
master:608:608 [0] NCCL INFO NET/IB : Using [0]rxe0:1/RoCE [RO]; OOB ens121f1:192.168.10.13<0>
master:608:608 [0] NCCL INFO Using non-device net plugin version 0
master:608:608 [0] NCCL INFO Using network IB
master:608:608 [0] NCCL INFO comm 0xb9dfd00 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 36000 commId 0x44748f6c8a51311c - Init START
master:608:608 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
master:608:608 [0] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'rxe0'
master:608:608 [0] NCCL INFO === System : maxBw 1.2 totalBw 24.0 ===
master:608:608 [0] NCCL INFO CPU/0 (1/1/2)
master:608:608 [0] NCCL INFO + PCI[5000.0] - NIC/0
master:608:608 [0] NCCL INFO                 + NET[1.2] - NET/0 (72585afeff9196b6/1/1.250000)
master:608:608 [0] NCCL INFO + PCI[24.0] - PCI/31000 (11f8400011f8beef)
master:608:608 [0] NCCL INFO               + PCI[24.0] - GPU/36000 (0)
master:608:608 [0] NCCL INFO ==========================================
master:608:608 [0] NCCL INFO GPU/36000 :GPU/36000 (0/5000.000000/LOC) CPU/0 (2/24.000000/PHB) NET/0 (4/1.250000/PHB)
master:608:608 [0] NCCL INFO NET/0 :GPU/36000 (4/1.250000/PHB) CPU/0 (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
master:608:608 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ff000000,0fffffff
master:608:608 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 1.200000/1.200000, type LOC/PHB, sameChannels 1
master:608:608 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
master:608:608 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 2.400000/1.200000, type LOC/PHB, sameChannels 1
master:608:608 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
master:608:608 [0] NCCL INFO comm 0xb9dfd00 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
master:608:608 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
master:608:608 [0] NCCL INFO Tree 1 : 1 -> 0 -> -1/-1/-1
master:608:608 [0] NCCL INFO Channel 00/02 :    0   1
master:608:608 [0] NCCL INFO Channel 01/02 :    0   1
master:608:608 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1
master:608:608 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1
master:608:608 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
master:608:608 [0] NCCL INFO P2P Chunksize set to 131072
master:608:608 [0] NCCL INFO UDS: Creating service thread comm 0xb9dfd00 rank 0
master:608:608 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
master:608:608 [0] NCCL INFO channel.cc:40 Cuda Alloc Size 1152 pointer 0x73a85d400000
master:608:608 [0] NCCL INFO channel.cc:43 Cuda Alloc Size 32 pointer 0x73a85d400600
master:608:608 [0] NCCL INFO channel.cc:54 Cuda Alloc Size 8 pointer 0x73a85d400800
master:608:608 [0] NCCL INFO channel.cc:40 Cuda Alloc Size 1152 pointer 0x73a85d400a00
master:608:608 [0] NCCL INFO channel.cc:43 Cuda Alloc Size 32 pointer 0x73a85d401000
master:608:608 [0] NCCL INFO channel.cc:54 Cuda Alloc Size 8 pointer 0x73a85d401200
master:608:7169 [0] NCCL INFO Mem Realloc old size 0, new size 8 pointer 0x73a868000be0
master:608:7169 [0] NCCL INFO Allocated 5767524 bytes of shared memory in /dev/shm/nccl-2iqVjb
master:608:7169 [0] NCCL INFO New proxy recv connection 0 from local rank 0, transport 2
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=1 op.reqBuff=0x73a868000ba0 op.respSize=16 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Init res=0
master:608:608 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x73a868004f40
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=3 op.reqBuff=0x73a868008f20 op.respSize=128 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Setup res=0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IB/0
master:608:7169 [0] NCCL INFO New proxy recv connection 1 from local rank 0, transport 2
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=1 op.reqBuff=0x73a86800e370 op.respSize=16 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Init res=0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x73a868004fb8
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=3 op.reqBuff=0x73a86800e3b0 op.respSize=128 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Setup res=0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IB/0
master:608:7169 [0] NCCL INFO New proxy send connection 2 from local rank 0, transport 2
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=1 op.reqBuff=0x73a868013800 op.respSize=16 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Init res=0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x73a868005030
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=3 op.reqBuff=0x73a868013840 op.respSize=0 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Setup res=0
master:608:608 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IB/0
master:608:7169 [0] NCCL INFO New proxy send connection 3 from local rank 0, transport 2
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=1 op.reqBuff=0x73a868018b60 op.respSize=16 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:7169 [0] NCCL INFO Received and initiated operation=Init res=0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Connected to proxy localRank 0 -> connection 0x73a8680050a8
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf0891c0 op.type=3 op.reqBuff=0x73a868018ba0 op.respSize=0 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf0891c0
master:608:608 [0] NCCL INFO resp.opId=0xf0891c0 matches expected opId=0xf0891c0
master:608:608 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IB/0
master:608:7169 [0] NCCL INFO Received and initiated operation=Setup res=0
master:608:608 [0] NCCL INFO sendConnect ncclProxyCallAsync opId=0xf1cb6e8
master:608:608 [0] NCCL INFO recvConnect ncclProxyCallAsync opId=0xf1cb878 &recv->proxyConn=0xf1cb880 connectInfo=0xf1df690
master:608:608 [0] NCCL INFO sendConnect ncclProxyCallAsync opId=0xf1dd668
master:608:608 [0] NCCL INFO recvConnect ncclProxyCallAsync opId=0xf1dd7f8 &recv->proxyConn=0xf1dd800 connectInfo=0xf1df710
master:608:7169 [0] NCCL INFO transport/net_ib.cc:773 Ib Alloc Size 181536 pointer 0x73a868024000
master:608:7169 [0] NCCL INFO Call to ibv_query_ece skipped, internal_name doesn't exist
master:608:7169 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
master:608:7169 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 30 mtu 3 query_ece={supported=0, vendor_id=0x0, options=0x0, comp_mask=0x0} GID 3 (0/0) fifoRkey=0xa11 fifoLkey=0xa11
master:608:7169 [0] NCCL INFO transport/net_ib.cc:867 Ib Alloc Size 3336 pointer 0x73a868052000
master:608:7169 [0] NCCL INFO Received and initiated operation=Connect res=0
master:608:7169 [0] NCCL INFO transport/net_ib.cc:978 Ib Alloc Size 164416 pointer 0x73a868059000
master:608:7169 [0] NCCL INFO transport/net_ib.cc:991 Ib Alloc Size 3336 pointer 0x73a868083000
master:608:7169 [0] NCCL INFO NCCL_IB_TC set by environment to 106.
master:608:7169 [0] NCCL INFO NCCL_IB_SL set by environment to 3.

master:608:7169 [0] misc/ibvwrap.cc:206 NCCL WARN Call to ibv_modify_qp failed with error No data available errno 61
master:608:7169 [0] NCCL INFO transport/net_ib.cc:725 -> 2
master:608:7169 [0] NCCL INFO transport/net_ib.cc:1067 -> 2
master:608:7169 [0] NCCL INFO transport/net.cc:833 -> 2
master:608:7169 [0] NCCL INFO proxyProgressAsync opId=0xf1cb878 op.type=4 op.reqBuff=0x73a868018b80 op.respSize=21040 done
master:608:608 [0] NCCL INFO ncclPollProxyResponse Received new opId=0xf1cb878
master:608:7169 [0] NCCL INFO Received and initiated operation=Connect res=0
master:608:608 [0] NCCL INFO Queuing opId=0xf1cb878 respBuff=0xf20a890 respSize=21040
master:608:608 [0] NCCL INFO ncclPollProxyResponse Dequeued cached opId=0xf1cb878
master:608:608 [0] NCCL INFO transport/net.cc:402 -> 2
master:608:608 [0] NCCL INFO transport.cc:183 -> 2
master:608:608 [0] NCCL INFO init.cc:1222 -> 2
master:608:608 [0] NCCL INFO init.cc:1501 -> 2
master:608:608 [0] NCCL INFO init.cc:1746 -> 2
master:608:7169 [0] NCCL INFO transport/net_ib.cc:773 Ib Alloc Size 181536 pointer 0x73a868085000
master:608:7169 [0] NCCL INFO Call to ibv_query_ece skipped, internal_name doesn't exist
master:608:7169 [0] NCCL INFO NET/IB: NCCL Dev 0 IbDev 0 Port 1 qpn 32 mtu 3 query_ece={supported=0, vendor_id=0x0, options=0x0, comp_mask=0x0} GID 3 (0/0) fifoRkey=0xbdb fifoLkey=0xbdb
master:608:7169 [0] NCCL INFO transport/net_ib.cc:867 Ib Alloc Size 3336 pointer 0x73a8680b3000
*** SIGSEGV received at time=1730040496 on cpu 57 ***
master:608:608 [0] NCCL INFO init.cc:1784 -> 2
PC: @     0x73ad5a019cfa  (unknown)  proxyProgressAsync()
    @     0x73ae05026090  (unknown)  (unknown)
[2024-10-27 07:48:16,289 E 608 7169] logging.cc:440: *** SIGSEGV received at time=1730040496 on cpu 57 ***
[2024-10-27 07:48:16,289 E 608 7169] logging.cc:440: PC: @     0x73ad5a019cfa  (unknown)  proxyProgressAsync()
[2024-10-27 07:48:16,289 E 608 7169] logging.cc:440:     @     0x73ae05026090  (unknown)  (unknown)
Fatal Python error: Segmentation fault

ERROR 10-27 07:48:16 worker_base.py:464] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 10-27 07:48:16 worker_base.py:464] Traceback (most recent call last):
ERROR 10-27 07:48:16 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 10-27 07:48:16 worker_base.py:464]     return executor(*args, **kwargs)
ERROR 10-27 07:48:16 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 175, in init_device
ERROR 10-27 07:48:16 worker_base.py:464]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 10-27 07:48:16 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 450, in init_worker_distributed_environment
ERROR 10-27 07:48:16 worker_base.py:464]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
ERROR 10-27 07:48:16 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
ERROR 10-27 07:48:16 worker_base.py:464]     initialize_model_parallel(tensor_model_parallel_size,
ERROR 10-27 07:48:16 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 947, in initialize_model_parallel
ERROR 10-27 07:48:16 worker_base.py:464]     _PP = init_model_parallel_group(group_ranks,
ERROR 10-27 07:48:16 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
ERROR 10-27 07:48:16 worker_base.py:464]     return GroupCoordinator(
ERROR 10-27 07:48:16 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 154, in __init__
ERROR 10-27 07:48:16 worker_base.py:464]     self.pynccl_comm = PyNcclCommunicator(
ERROR 10-27 07:48:16 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
ERROR 10-27 07:48:16 worker_base.py:464]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
ERROR 10-27 07:48:16 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
ERROR 10-27 07:48:16 worker_base.py:464]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
ERROR 10-27 07:48:16 worker_base.py:464]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
ERROR 10-27 07:48:16 worker_base.py:464]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 10-27 07:48:16 worker_base.py:464] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
Process SpawnProcess-1:

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, yaml._yaml, msgspec._core, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, PIL._imaging, PIL._imagingft, regex._regex, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, zmq.backend.cython._zmq, pyarrow.lib, pyarrow._json (total: 45)
ERROR 10-27 07:48:19 api_server.py:186] RPCServer process died before responding to readiness probe
kiskra-nvidia commented 1 month ago

@dwq370 This could be a completely different problem; I recommend that you open a separate issue for it. Still, here's some quick feedback:

The segmentation fault appears to be secondary to this error:

master:608:7169 [0] misc/ibvwrap.cc:206 NCCL WARN Call to ibv_modify_qp failed with error No data available errno 61

How many processes does your NCCL job consist of? The debug output you included appears to be all from a single process (608 on node master) -- is that the only one? What was the job configuration (number of nodes/processes/GPUs)? I'm asking because the above error could well be the result of an error on another process/node, but we don't see the output from any other NCCL processes...