HabanaAI / vllm-fork

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
38 stars 46 forks source link

[Bug]: `--enable-lora` raises error while trying to start api_server #405

Open JHLEE17 opened 1 day ago

JHLEE17 commented 1 day ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... /home/irteamsu/miniconda3/envs/jongho/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn( PyTorch version: 2.3.1a0+git4989238 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (conda-forge gcc 12.1.0-17) 12.1.0 Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.35 Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 160 On-line CPU(s) list: 0-159 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 40 Socket(s): 2 Stepping: 6 CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 4600.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities L1d cache: 3.8 MiB (80 instances) L1i cache: 2.5 MiB (80 instances) L2 cache: 100 MiB (80 instances) L3 cache: 120 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-39,80-119 NUMA node1 CPU(s): 40-79,120-159 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] habana-torch-dataloader==1.17.0.495 [pip3] habana-torch-plugin==1.17.0.495 [pip3] numpy==1.26.4 [pip3] pynvml==8.0.4 [pip3] pytorch-lightning==2.4.0 [pip3] pyzmq==26.2.0 [pip3] torch==2.3.1a0+git4989238 [pip3] torch_tb_profiler==0.4.0 [pip3] torchaudio==2.3.0+952ea74 [pip3] torchdata==0.7.1+5e6f7b7 [pip3] torchmetrics==1.4.1 [pip3] torchtext==0.18.0a0+9bed85d [pip3] torchvision==0.18.1a0+fe70bc8 [pip3] transformers==4.45.2 [pip3] triton==3.1.0 [conda] habana-torch-dataloader 1.17.0.495 pypi_0 pypi [conda] habana-torch-plugin 1.17.0.495 pypi_0 pypi [conda] numpy 1.26.4 pypi_0 pypi [conda] pynvml 8.0.4 pypi_0 pypi [conda] pytorch-lightning 2.4.0 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.3.1a0+git4989238 pypi_0 pypi [conda] torch-tb-profiler 0.4.0 pypi_0 pypi [conda] torchaudio 2.3.0+952ea74 pypi_0 pypi [conda] torchdata 0.7.1+5e6f7b7 pypi_0 pypi [conda] torchmetrics 1.4.1 pypi_0 pypi [conda] torchtext 0.18.0a0+9bed85d pypi_0 pypi [conda] torchvision 0.18.1a0+fe70bc8 pypi_0 pypi [conda] transformers 4.45.2 pypi_0 pypi [conda] triton 3.1.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.dev553+g9276ccca vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect ```

Model Input Dumps

No response

🐛 Describe the bug

I encountered an error while trying to start the api_server with Multi-LoRA.

The command I used is as follows: (you can reproduce the error w/o last 3 lines in the command)

python -m vllm.entrypoints.openai.api_server \
    --model /home/irteamsu/models/Meta-Llama-3.1-8B-Instruct \
    --block-size 128 \
    --max-model-len 2048 \
    --disable-log-requests \
    --enable-lora \
    --max-loras 2 \
    --max-lora-rank 8 \
    --lora-modules lora-1=/home/irteamsu/models/Gaudi_LoRA_Llama-3-8B-Instruct lora-2=/home/irteamsu/models/Gaudi_LoRA_Llama-3-8B-Instruct

However, when I run this command, the following error occurs:

Error logs ```bash ~/habanaAI/vllm-fork$ python -m vllm.entrypoints.openai.api_server --model /home/irteamsu/models/Meta-Llama-3-8B-Instruct --block-size 128 --max-model-len 2048 --disable-log-requests --enable-lora --max-loras 2 --max-lora-rank 8 --lora-modules lora-1=/home/irteamsu/models/Gaudi_LoRA_Llama-3-8B-Instruct lora-2=/home/irteamsu/models/Gaudi_LoRA_Llama-3-8B-Instruct --port 8001 /home/irteamsu/miniconda3/envs/jongho/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn( INFO 10-18 16:07:06 api_server.py:527] vLLM API server version 0.6.3.dev553+g9276ccca INFO 10-18 16:07:06 api_server.py:528] args: Namespace(host=None, port=8001, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='lora-1', path='/home/irteamsu/models/Gaudi_LoRA_Llama-3-8B-Instruct', base_model_name=None), LoRAModulePath(name='lora-2', path='/home/irteamsu/models/Gaudi_LoRA_Llama-3-8B-Instruct', base_model_name=None)], prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/irteamsu/models/Meta-Llama-3-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', weights_load_device=None, config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=128, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=True, max_loras=2, max_lora_rank=8, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=True, max_log_len=None, disable_fastapi_docs=False) INFO 10-18 16:07:06 api_server.py:165] Multiprocessing frontend to use ipc:///tmp/73468b3b-0de1-4f35-a284-2c7025932957 for IPC Path. INFO 10-18 16:07:06 api_server.py:178] Started engine process with PID 954103 /home/irteamsu/miniconda3/envs/jongho/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn( INFO 10-18 16:07:12 llm_engine.py:238] Initializing an LLM engine (v0.6.3.dev553+g9276ccca) with config: model='/home/irteamsu/models/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='/home/irteamsu/models/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/irteamsu/models/Meta-Llama-3-8B-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) WARNING 10-18 16:07:12 utils.py:794] Pin memory is not supported on HPU. INFO 10-18 16:07:12 selector.py:147] Using HPUAttention backend. INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1) INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32) INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_PROMPT_BS_BUCKET_MAX=64 (default:64) INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_DECODE_BS_BUCKET_MIN=1 (default:1) INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_DECODE_BS_BUCKET_STEP=32 (default:32) INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_DECODE_BS_BUCKET_MAX=256 (default:256) INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128) INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128) INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024) INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128) INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128) INFO 10-18 16:07:12 hpu_model_runner.py:98] VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096) INFO 10-18 16:07:12 hpu_model_runner.py:706] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024] INFO 10-18 16:07:12 hpu_model_runner.py:711] Decode bucket config (min, step, max_warmup) bs:[1, 32, 256], block:[128, 128, 4096] ============================= HABANA PT BRIDGE CONFIGURATION =========================== PT_HPU_LAZY_MODE = 1 PT_RECIPE_CACHE_PATH = PT_CACHE_FOLDER_DELETE = 0 PT_HPU_RECIPE_CACHE_CONFIG = PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807 PT_HPU_LAZY_ACC_PAR_MODE = 1 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0 ---------------------------: System Configuration :--------------------------- Num CPU Cores : 160 CPU RAM : 2113407780 KB ------------------------------------------------------------------------------ INFO 10-18 16:07:16 selector.py:147] Using HPUAttention backend. INFO 10-18 16:07:16 loader.py:405] Loading weights on hpu... Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00

The issue occurs with both commits 9276ccc(habana_main) and d6bd375(remove-lora-warmup-constraints). I've verified that the paths to the models are correct and that the models are accessible.

Any guidance on resolving this issue would be greatly appreciated.

Before submitting a new issue...

  • [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
vivekgoe commented 1 day ago

@JHLEE17 Thanks for raising this issue, we will check it immediately. Can you please share information about which SynapseAI release you are using to run this test?

JHLEE17 commented 1 day ago

I executed on 1.17 version. But I got a similar error on 1.18 version.

vivekgoe commented 1 day ago

We seem to have a backward compatibility issue, https://github.com/HabanaAI/vllm-fork/pull/382 works on latest SynapseAI code (not released yet), but throws above error with 1.18.0 SynapseAI code. We will work on fixing this and get back asap.