HabanaAI / vllm-fork

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
36 stars 41 forks source link

[Bug]: RuntimeError: synStatus=26 [Generic failure] Device acquire failed. #344

Closed ligjn closed 4 days ago

ligjn commented 6 days ago

Your current environment

PyTorch version: 2.3.1a0+git4989238 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.35

Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8457C CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 Stepping: 8 CPU max MHz: 3800.0000 CPU min MHz: 800.0000 BogoMIPS: 5200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 4.5 MiB (96 instances) L1i cache: 3 MiB (96 instances) L2 cache: 192 MiB (96 instances) L3 cache: 195 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] habana-torch-dataloader==1.17.0.495 [pip3] habana-torch-plugin==1.17.0.495 [pip3] numpy==1.26.4 [pip3] nvidia-ml-py==12.560.30 [pip3] pynvml==8.0.4 [pip3] pytorch-lightning==2.3.3 [pip3] pyzmq==26.2.0 [pip3] sentence-transformers==3.0.1 [pip3] torch==2.3.1a0+git4989238 [pip3] torch_tb_profiler==0.4.0 [pip3] torchaudio==2.3.0+952ea74 [pip3] torchdata==0.7.1+5e6f7b7 [pip3] torchmetrics==1.4.0.post0 [pip3] torchtext==0.18.0a0+9bed85d [pip3] torchvision==0.18.1a0+fe70bc8 [pip3] transformers==4.43.4 [pip3] transformers-stream-generator==0.0.5 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.3.post1@bc39baa482dcfefeae6289e80cea63b4adc9beeb vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect

Model Input Dumps

yWorkerWrapper pid=2593) /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead (RayWorkerWrapper pid=2593) warnings.warn( (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 profiler.py:62] Profiler enabled for: vllm-instance-481e316c161e41a5b6fe556e0bc6e8d5 (RayWorkerWrapper pid=2593) WARNING 09-27 09:41:52 utils.py:597] Pin memory is not supported on HPU. (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 selector.py:85] Using HabanaAttention backend. (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_PROMPT_BS_BUCKET_MAX=64 (default:64) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_DECODE_BS_BUCKET_MIN=32 (default:32) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_DECODE_BS_BUCKET_STEP=32 (default:32) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_DECODE_BS_BUCKET_MAX=256 (default:256) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:96] VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096) (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:693] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024] (RayWorkerWrapper pid=2593) INFO 09-27 09:41:52 habana_model_runner.py:698] Decode bucket config (min, step, max_warmup) bs:[32, 32, 256], block:[128, 128, 4096] (pid=3257) WARNING 09-27 09:41:51 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") [repeated 3x across cluster] ERROR 09-27 09:41:52 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution. ERROR 09-27 09:41:52 worker_base.py:382] Traceback (most recent call last): ERROR 09-27 09:41:52 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method ERROR 09-27 09:41:52 worker_base.py:382] return executor(args, kwargs) ERROR 09-27 09:41:52 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/habana_worker.py", line 108, in init_device ERROR 09-27 09:41:52 worker_base.py:382] torch.hpu.set_device(self.device) ERROR 09-27 09:41:52 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 309, in set_device ERROR 09-27 09:41:52 worker_base.py:382] device_idx = _get_device_index(device, optional=True) ERROR 09-27 09:41:52 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/_utils.py", line 40, in _get_device_index ERROR 09-27 09:41:52 worker_base.py:382] device_idx = hpu.current_device() ERROR 09-27 09:41:52 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 133, in current_device ERROR 09-27 09:41:52 worker_base.py:382] init() ERROR 09-27 09:41:52 worker_base.py:382] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 72, in init ERROR 09-27 09:41:52 worker_base.py:382] _hpu_C.init() ERROR 09-27 09:41:52 worker_base.py:382] RuntimeError: synStatus=26 [Generic failure] Device acquire failed. Traceback (most recent call last): File "/app/llm-finetune-szw1_dev/model/Smart-Vision-Model-Server/src/api_demo.py", line 41, in main() File "/app/llm-finetune-szw1_dev/model/Smart-Vision-Model-Server/src/api_demo.py", line 25, in main chat_model = ChatModel() File "/app/llm-finetune-szw1_dev/model/Smart-Vision-Model-Server/src/llmtuner/chat/chat_model.py", line 45, in init self.engine: "BaseEngine" = VllmEngine(model_args, data_args, finetuning_args, generating_args) File "/app/llm-finetune-szw1_dev/model/Smart-Vision-Model-Server/src/llmtuner/chat/vllm_engine.py", line 100, in init self.model = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(engine_args)) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 479, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in init self.engine = self._init_engine(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 560, in _init_engine return engine_class(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 252, in init self.model_executor = executor_class( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_habana_executor.py", line 382, in init super().init(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init super().init(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in init self._init_executor() File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_habana_executor.py", line 64, in _init_executor self._init_workers_ray(placement_group) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_habana_executor.py", line 206, in _init_workers_ray self._run_workers("init_device") File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_habana_executor.py", line 326, in _run_workers self.driver_worker.execute_method(method, driver_args, File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 383, in execute_method raise e File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method return executor(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/habana_worker.py", line 108, in init_device torch.hpu.set_device(self.device) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 309, in set_device device_idx = _get_device_index(device, optional=True) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/_utils.py", line 40, in _get_device_index device_idx = hpu.current_device() File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 133, in current_device init() File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/init.py", line 72, in init _hpu_C.init() RuntimeError: synStatus=26 [Generic failure] Device acquire failed.

🐛 Describe the bug

In k8s 1.22.0, using HF for model inference can start normally, but using VLLM inference will report the following error: Runtime Error: synStatus=26 [Generic failure] Device acquisition failed.

And using Docker alone for vllm inference can also run normally. The configuration of k8s and Docker startup containers is completely in accordance with the configuration on the Habana official website, and the meaning of the two configurations is completely the same. However, errors will occur on k8s. Why is this?

Before submitting a new issue...

michalkuligowski commented 6 days ago

Hi, could you provide logs as in: https://docs.habana.ai/en/latest/PyTorch/Reference/Debugging_Guide/Debugging_with_Intel_Gaudi_Logs.html Probably errors/warnings from dmesg and synapse_runtime.log will be enough to debug