NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.26k stars 827 forks source link

[BUG]: NCCL_SHM_DISABLE flag is not working #1466

Open priyanshu891 opened 1 month ago

priyanshu891 commented 1 month ago

Hi, I have observed although i have passed NCCL_SHM_DISABLE: 1. Still it try to access /dev/shm and gave the error. Is this behaviour is as expected or it's a bug. Below i have attached the log for the reference. Let me know i am missing something.

DEBUG:     09-25 06:40:44 server.py:24] NCCL_SHM_DISABLE: 1
INFO 09-25 06:40:44 config.py:904] Defaulting to use mp for distributed inference
INFO 09-25 06:40:44 llm_engine.py:223] Initializing an LLM engine (v0.6.1.post2) with config: model='model/id', speculative_config=None, tokenizer='model/id', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=model/id, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
WARNING 09-25 06:40:44 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 09-25 06:40:44 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=675597) INFO 09-25 06:40:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=675599) INFO 09-25 06:40:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=675598) INFO 09-25 06:40:45 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 09-25 06:40:46 utils.py:981] Found nccl from library libnccl.so.2
INFO 09-25 06:40:46 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=675599) INFO 09-25 06:40:46 utils.py:981] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=675597) INFO 09-25 06:40:46 utils.py:981] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=675599) INFO 09-25 06:40:46 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=675598) INFO 09-25 06:40:46 utils.py:981] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=675597) INFO 09-25 06:40:46 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=675598) INFO 09-25 06:40:46 pynccl.py:63] vLLM is using nccl==2.20.5
pod-id:num:num [0] NCCL INFO Bootstrap : Using eth0:IP<0>
pod-id:num:num [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
pod-id:num:num [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.20.5+cuda12.4
pod-id:num:num [1] NCCL INFO cudaDriverVersion 12010
pod-id:num:num [3] NCCL INFO cudaDriverVersion 12010
pod-id:num:num [1] NCCL INFO Bootstrap : Using eth0:IP<0>
pod-id:num:num [2] NCCL INFO cudaDriverVersion 12010
pod-id:num:num [3] NCCL INFO Bootstrap : Using eth0:IP<0>
pod-id:num:num [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
pod-id:num:num [2] NCCL INFO Bootstrap : Using eth0:IP<0>
pod-id:num:num [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
pod-id:num:num [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
pod-id:num:num [2] NCCL INFO Failed to open libibverbs.so[.1]
pod-id:num:num [3] NCCL INFO Failed to open libibverbs.so[.1]
pod-id:num:num [2] NCCL INFO NET/Socket : Using [0]eth0:IP<0>
pod-id:num:num [3] NCCL INFO NET/Socket : Using [0]eth0:IP<0>
pod-id:num:num [2] NCCL INFO Using non-device net plugin version 0
pod-id:num:num [3] NCCL INFO Using non-device net plugin version 0
pod-id:num:num [2] NCCL INFO Using network Socket
pod-id:num:num [3] NCCL INFO Using network Socket
pod-id:num:num [1] NCCL INFO Failed to open libibverbs.so[.1]
pod-id:num:num [0] NCCL INFO Failed to open libibverbs.so[.1]
pod-id:num:num [1] NCCL INFO NET/Socket : Using [0]eth0:IP<0>
pod-id:num:num [0] NCCL INFO NET/Socket : Using [0]eth0:IP<0>
pod-id:num:num [1] NCCL INFO Using non-device net plugin version 0
pod-id:num:num [0] NCCL INFO Using non-device net plugin version 0
pod-id:num:num [1] NCCL INFO Using network Socket
pod-id:num:num [0] NCCL INFO Using network Socket
pod-id:num:num [0] NCCL INFO comm 0x56300ffd5e70 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 41000 commId 0x212dba036d0f1426 - Init START
pod-id:num:num [3] NCCL INFO comm 0x56300ffd3cf0 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId a1000 commId 0x212dba036d0f1426 - Init START
pod-id:num:num [1] NCCL INFO comm 0x56300ffd2cc0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 61000 commId 0x212dba036d0f1426 - Init START
pod-id:num:num [2] NCCL INFO comm 0x56300ffd2f90 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 81000 commId 0x212dba036d0f1426 - Init START
pod-id:num:num [3] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
pod-id:num:num [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
pod-id:num:num [2] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
pod-id:num:num [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
pod-id:num:num [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff,00000000
pod-id:num:num [3] NCCL INFO NVLS multicast support is not available on dev 3
pod-id:num:num [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
pod-id:num:num [1] NCCL INFO NVLS multicast support is not available on dev 1
pod-id:num:num [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff,00000000
pod-id:num:num [2] NCCL INFO NVLS multicast support is not available on dev 2
pod-id:num:num [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
pod-id:num:num [0] NCCL INFO NVLS multicast support is not available on dev 0
pod-id:num:num [3] NCCL INFO comm 0x56300ffd3cf0 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
pod-id:num:num [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
pod-id:num:num [3] NCCL INFO P2P Chunksize set to 131072
pod-id:num:num [2] NCCL INFO comm 0x56300ffd2f90 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
pod-id:num:num [0] NCCL INFO comm 0x56300ffd5e70 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
pod-id:num:num [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
pod-id:num:num [1] NCCL INFO comm 0x56300ffd2cc0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
pod-id:num:num [2] NCCL INFO P2P Chunksize set to 131072
pod-id:num:num [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
pod-id:num:num [0] NCCL INFO Channel 00/02 :    0   1   2   3
pod-id:num:num [0] NCCL INFO Channel 01/02 :    0   1   2   3
pod-id:num:num [1] NCCL INFO P2P Chunksize set to 131072
pod-id:num:num [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
pod-id:num:num [0] NCCL INFO P2P Chunksize set to 131072
pod-id:num:num [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/IPC
pod-id:num:num [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC
pod-id:num:num [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/IPC
pod-id:num:num [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC
pod-id:num:num [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
pod-id:num:num [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC
pod-id:num:num [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC
pod-id:num:num [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
pod-id:num:num [2] NCCL INFO Connected all rings
pod-id:num:num [1] NCCL INFO Connected all rings
pod-id:num:num [3] NCCL INFO Connected all rings
pod-id:num:num [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC
pod-id:num:num [0] NCCL INFO Connected all rings
pod-id:num:num [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC
pod-id:num:num [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC
pod-id:num:num [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC
pod-id:num:num [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC
pod-id:num:num [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC
pod-id:num:num [3] NCCL INFO Connected all trees
pod-id:num:num [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
pod-id:num:num [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
pod-id:num:num [0] NCCL INFO Connected all trees
pod-id:num:num [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
pod-id:num:num [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
pod-id:num:num [2] NCCL INFO Connected all trees
pod-id:num:num [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
pod-id:num:num [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
pod-id:num:num [1] NCCL INFO Connected all trees
pod-id:num:num [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
pod-id:num:num [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer

pod-id:num:num [1] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-i3YHLv to 5767524 bytes

pod-id:num:num [1] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-i3YHLv (size 5767520)

pod-id:num:num [2] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-LgIB3x to 5767524 bytes
pod-id:num:num [1] NCCL INFO proxy.cc:1252 -> 2

pod-id:num:num [0] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-YRwXLW to 5767524 bytes
pod-id:num:num [1] NCCL INFO proxy.cc:1315 -> 2

pod-id:num:num [2] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-LgIB3x (size 5767520)

pod-id:num:num [0] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-YRwXLW (size 5767520)
pod-id:num:num [2] NCCL INFO proxy.cc:1252 -> 2

pod-id:num:num [3] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-Jyxdc1 to 5767524 bytes
pod-id:num:num [1] NCCL INFO proxy.cc:1064 -> 2
pod-id:num:num [0] NCCL INFO proxy.cc:1252 -> 2
pod-id:num:num [2] NCCL INFO proxy.cc:1315 -> 2

pod-id:num:num [3] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-Jyxdc1 (size 5767520)
pod-id:num:num [0] NCCL INFO proxy.cc:1315 -> 2
pod-id:num:num [1] NCCL INFO init.cc:1328 -> 2
pod-id:num:num [3] NCCL INFO proxy.cc:1252 -> 2
pod-id:num:num [3] NCCL INFO proxy.cc:1315 -> 2
pod-id:num:num [1] NCCL INFO init.cc:1501 -> 2
pod-id:num:num [2] NCCL INFO proxy.cc:1064 -> 2
pod-id:num:num [0] NCCL INFO proxy.cc:1064 -> 2
pod-id:num:num [2] NCCL INFO init.cc:1328 -> 2
pod-id:num:num [1] NCCL INFO init.cc:1746 -> 2
pod-id:num:num [3] NCCL INFO proxy.cc:1064 -> 2
pod-id:num:num [0] NCCL INFO init.cc:1328 -> 2
pod-id:num:num [2] NCCL INFO init.cc:1501 -> 2
pod-id:num:num [3] NCCL INFO init.cc:1328 -> 2
pod-id:num:num [2] NCCL INFO init.cc:1746 -> 2
pod-id:num:num [3] NCCL INFO init.cc:1501 -> 2
pod-id:num:num [0] NCCL INFO init.cc:1501 -> 2
pod-id:num:num [3] NCCL INFO init.cc:1746 -> 2
pod-id:num:num [0] NCCL INFO init.cc:1746 -> 2
pod-id:num:num [2] NCCL INFO init.cc:1784 -> 2
pod-id:num:num [3] NCCL INFO init.cc:1784 -> 2
pod-id:num:num [1] NCCL INFO init.cc:1784 -> 2
pod-id:num:num [0] NCCL INFO init.cc:1784 -> 2
ERROR:     09-25 06:40:46 vllm_service.py:46] NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), Traceback (most recent call last):
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), Traceback (most recent call last):
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 176, in init_device
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 176, in init_device
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     init_worker_distributed_environment(self.parallel_config, self.rank,
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     init_worker_distributed_environment(self.parallel_config, self.rank,
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 451, in init_worker_distributed_environment
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 451, in init_worker_distributed_environment
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     initialize_model_parallel(tensor_model_parallel_size,
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     initialize_model_parallel(tensor_model_parallel_size,
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     _TP = init_model_parallel_group(group_ranks,
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     _TP = init_model_parallel_group(group_ranks,
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     return GroupCoordinator(
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     return GroupCoordinator(
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 154, in __init__
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 154, in __init__
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     self.pynccl_comm = PyNcclCommunicator(
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     self.pynccl_comm = PyNcclCommunicator(
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     raise RuntimeError(f"NCCL error: {error_str}")
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     raise RuntimeError(f"NCCL error: {error_str}")
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(VllmWorkerProcess pid=675598) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226] 
(VllmWorkerProcess pid=675599) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226] 
INFO:     127.0.0.1:40538 - "POST /vllm/load_model/ HTTP/1.1" 500 Internal Server Error
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method init_device: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), Traceback (most recent call last):
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 176, in init_device
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     init_worker_distributed_environment(self.parallel_config, self.rank,
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 451, in init_worker_distributed_environment
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     initialize_model_parallel(tensor_model_parallel_size,
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     _TP = init_model_parallel_group(group_ranks,
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     return GroupCoordinator(
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 154, in __init__
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     self.pynccl_comm = PyNcclCommunicator(
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]   File "/home/agentic/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226]     raise RuntimeError(f"NCCL error: {error_str}")
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
(VllmWorkerProcess pid=675597) ERROR 09-25 06:40:46 multiproc_worker_utils.py:226] 
INFO 09-25 06:40:46 multiproc_worker_utils.py:136] Terminating local vLLM worker processes
(VllmWorkerProcess pid=675598) INFO 09-25 06:40:46 multiproc_worker_utils.py:237] Worker exiting
(VllmWorkerProcess pid=675597) INFO 09-25 06:40:46 multiproc_worker_utils.py:237] Worker exiting
(VllmWorkerProcess pid=675599) INFO 09-25 06:40:46 multiproc_worker_utils.py:237] Worker exiting
Segmentation fault (core dumped)
sjeaugey commented 1 month ago

NCCL uses shared memory for many things, not just for intra-node communication (which NCCL_SHM_DISABLE=1 affects). You should probably fix your environment to allow the creation of some shared memory.