My k8s cluster has been installed gpu operator and could deploy model with single GPU. But I got error when trying to deploy a model with 4 GPUs in one node. Accroding to the log it seems can't enable NCCL P2P in k8s pod, even the GPUs are in the same node. Or is there a way to enable it?
Here is my k8s deployment file:
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-deployment namespace: model spec: replicas: 1 # You can scale this up to 10 selector: matchLabels: app: vllm template: metadata: labels: app: vllm spec: volumes: - name: dshm emptyDir: medium: Memory containers: - name: vllm-container image: vllm/vllm-openai:latest env: - name: NCCL_DEBUG value: INFO args: ["--model", "Qwen/Qwen2-72B-Instruct-GPTQ-Int4", "--download-dir", "/models", "--served-model-name", "qwen2-72b", "--kv-cache-dtype", "fp8_e4m3", "--tensor-parallel-size", "4", "--gpu-memory-utilization", "0.8", "--max-model-len", "14336", "--port", "11434"] # check the health of the container by hitting port 8000/health endpoint readinessProbe: httpGet: path: /health port: 11434 initialDelaySeconds: 5 periodSeconds: 1 ports: - containerPort: 11434 name: vllm-port resources: requests: memory: '5000Mi' nvidia.com/gpu: 4 limits: memory: '5000Mi' nvidia.com/gpu: 4 volumeMounts: - mountPath: /models name: model-storage volumes: - name: model-storage persistentVolumeClaim: claimName: model-storage-pvc
Here is the error log:
2024-08-13T07:37:27.223526999Z vllm-deployment-8577b94b74-fhx85:61:61 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. 2024-08-13T07:37:27.401053965Z vllm-deployment-8577b94b74-fhx85:61:61 [ERROR 08-13 07:37:27 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 111 died, exit code: -15 2024-08-13T07:37:27.401078075Z INFO 08-13 07:37:27 multiproc_worker_utils.py:123] Killing local vLLM worker processes 2024-08-13T07:37:27.506937412Z Process Process-1: 2024-08-13T07:37:27.508190691Z Traceback (most recent call last): 2024-08-13T07:37:27.508231292Z File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() 2024-08-13T07:37:27.508242732Z File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run 2024-08-13T07:37:27.508246402Z self._target(*self._args, **self._kwargs) 2024-08-13T07:37:27.508249772Z File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server 2024-08-13T07:37:27.508253072Z server = AsyncEngineRPCServer(async_engine_args, usage_context, port) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__ 2024-08-13T07:37:27.508259922Z self.engine = AsyncLLMEngine.from_engine_args(async_engine_args, 2024-08-13T07:37:27.508263112Z File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args 2024-08-13T07:37:27.508266912Z engine = cls( 2024-08-13T07:37:27.508270963Z File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in __init__ 2024-08-13T07:37:27.508275343Z self.engine = self._init_engine(*args, **kwargs) 2024-08-13T07:37:27.508279263Z File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine 2024-08-13T07:37:27.508283243Z return engine_class(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in __init__ 2024-08-13T07:37:27.508291193Z self.model_executor = executor_class( 2024-08-13T07:37:27.508295083Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__ 2024-08-13T07:37:27.508298743Z super().__init__(*args, **kwargs) 2024-08-13T07:37:27.508301943Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__ super().__init__(*args, **kwargs) 2024-08-13T07:37:27.508308343Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__ 2024-08-13T07:37:27.508311943Z self._init_executor() 2024-08-13T07:37:27.508315093Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor self._run_workers("init_device") 2024-08-13T07:37:27.508321623Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers 2024-08-13T07:37:27.508324823Z driver_worker_output = driver_worker_method(*args, **kwargs) 2024-08-13T07:37:27.508328293Z File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, 2024-08-13T07:37:27.508334623Z File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 348, in init_worker_distributed_environment 2024-08-13T07:37:27.508337834Z ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, 2024-08-13T07:37:27.508341424Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized 2024-08-13T07:37:27.508344584Z initialize_model_parallel(tensor_model_parallel_size, 2024-08-13T07:37:27.508347804Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel 2024-08-13T07:37:27.508350984Z _TP = init_model_parallel_group(group_ranks, 2024-08-13T07:37:27.508370514Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group 2024-08-13T07:37:27.508386234Z return GroupCoordinator( 2024-08-13T07:37:27.508389804Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 154, in __init__ 2024-08-13T07:37:27.508393004Z self.pynccl_comm = PyNcclCommunicator( File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__ 2024-08-13T07:37:27.508399564Z self.comm: ncclComm_t = self.nccl.ncclCommInitRank( 2024-08-13T07:37:27.508402775Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank 2024-08-13T07:37:27.508405955Z self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), 2024-08-13T07:37:27.508409115Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK 2024-08-13T07:37:27.508412315Z raise RuntimeError(f"NCCL error: {error_str}") 2024-08-13T07:37:27.508415555Z RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
My k8s cluster has been installed gpu operator and could deploy model with single GPU. But I got error when trying to deploy a model with 4 GPUs in one node. Accroding to the log it seems can't enable NCCL P2P in k8s pod, even the GPUs are in the same node. Or is there a way to enable it? Here is my k8s deployment file:
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-deployment namespace: model spec: replicas: 1 # You can scale this up to 10 selector: matchLabels: app: vllm template: metadata: labels: app: vllm spec: volumes: - name: dshm emptyDir: medium: Memory containers: - name: vllm-container image: vllm/vllm-openai:latest env: - name: NCCL_DEBUG value: INFO args: ["--model", "Qwen/Qwen2-72B-Instruct-GPTQ-Int4", "--download-dir", "/models", "--served-model-name", "qwen2-72b", "--kv-cache-dtype", "fp8_e4m3", "--tensor-parallel-size", "4", "--gpu-memory-utilization", "0.8", "--max-model-len", "14336", "--port", "11434"] # check the health of the container by hitting port 8000/health endpoint readinessProbe: httpGet: path: /health port: 11434 initialDelaySeconds: 5 periodSeconds: 1 ports: - containerPort: 11434 name: vllm-port resources: requests: memory: '5000Mi' nvidia.com/gpu: 4 limits: memory: '5000Mi' nvidia.com/gpu: 4 volumeMounts: - mountPath: /models name: model-storage volumes: - name: model-storage persistentVolumeClaim: claimName: model-storage-pvc
Here is the error log:
2024-08-13T07:37:27.223526999Z vllm-deployment-8577b94b74-fhx85:61:61 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. 2024-08-13T07:37:27.401053965Z vllm-deployment-8577b94b74-fhx85:61:61 [ERROR 08-13 07:37:27 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 111 died, exit code: -15 2024-08-13T07:37:27.401078075Z INFO 08-13 07:37:27 multiproc_worker_utils.py:123] Killing local vLLM worker processes 2024-08-13T07:37:27.506937412Z Process Process-1: 2024-08-13T07:37:27.508190691Z Traceback (most recent call last): 2024-08-13T07:37:27.508231292Z File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() 2024-08-13T07:37:27.508242732Z File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run 2024-08-13T07:37:27.508246402Z self._target(*self._args, **self._kwargs) 2024-08-13T07:37:27.508249772Z File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server 2024-08-13T07:37:27.508253072Z server = AsyncEngineRPCServer(async_engine_args, usage_context, port) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__ 2024-08-13T07:37:27.508259922Z self.engine = AsyncLLMEngine.from_engine_args(async_engine_args, 2024-08-13T07:37:27.508263112Z File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args 2024-08-13T07:37:27.508266912Z engine = cls( 2024-08-13T07:37:27.508270963Z File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in __init__ 2024-08-13T07:37:27.508275343Z self.engine = self._init_engine(*args, **kwargs) 2024-08-13T07:37:27.508279263Z File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine 2024-08-13T07:37:27.508283243Z return engine_class(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in __init__ 2024-08-13T07:37:27.508291193Z self.model_executor = executor_class( 2024-08-13T07:37:27.508295083Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__ 2024-08-13T07:37:27.508298743Z super().__init__(*args, **kwargs) 2024-08-13T07:37:27.508301943Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__ super().__init__(*args, **kwargs) 2024-08-13T07:37:27.508308343Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__ 2024-08-13T07:37:27.508311943Z self._init_executor() 2024-08-13T07:37:27.508315093Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor self._run_workers("init_device") 2024-08-13T07:37:27.508321623Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers 2024-08-13T07:37:27.508324823Z driver_worker_output = driver_worker_method(*args, **kwargs) 2024-08-13T07:37:27.508328293Z File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, 2024-08-13T07:37:27.508334623Z File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 348, in init_worker_distributed_environment 2024-08-13T07:37:27.508337834Z ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, 2024-08-13T07:37:27.508341424Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized 2024-08-13T07:37:27.508344584Z initialize_model_parallel(tensor_model_parallel_size, 2024-08-13T07:37:27.508347804Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel 2024-08-13T07:37:27.508350984Z _TP = init_model_parallel_group(group_ranks, 2024-08-13T07:37:27.508370514Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group 2024-08-13T07:37:27.508386234Z return GroupCoordinator( 2024-08-13T07:37:27.508389804Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 154, in __init__ 2024-08-13T07:37:27.508393004Z self.pynccl_comm = PyNcclCommunicator( File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__ 2024-08-13T07:37:27.508399564Z self.comm: ncclComm_t = self.nccl.ncclCommInitRank( 2024-08-13T07:37:27.508402775Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank 2024-08-13T07:37:27.508405955Z self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), 2024-08-13T07:37:27.508409115Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK 2024-08-13T07:37:27.508412315Z raise RuntimeError(f"NCCL error: {error_str}") 2024-08-13T07:37:27.508415555Z RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)