HabanaAI / vllm-fork

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
42 stars 54 forks source link

[Bug]: habana_main - Qwen2-7B fail on graph compile failed. #232

Closed Zjq9409 closed 2 weeks ago

Zjq9409 commented 2 months ago

Your current environment

driver 1.17 vllm 0.5.3.post1+gaudi117

export VLLM_GRAPH_RESERVED_MEM=0.1
export VLLM_GRAPH_PROMPT_RATIO=0.9
export VLLM_PROMPT_SEQ_BUCKET_MIN=2048
export VLLM_PROMPT_SEQ_BUCKET_STEP=2048
export VLLM_PROMPT_SEQ_BUCKET_MAX=2048

export VLLM_PROMPT_BS_BUCKET_MIN=100
export VLLM_PROMPT_BS_BUCKET_STEP=100
export VLLM_PROMPT_BS_BUCKET_MAX=100

export VLLM_DECODE_SEQ_BUCKET_MIN=2048
export VLLM_DECODE_SEQ_BUCKET_STEP=2048
export VLLM_DECODE_SEQ_BUCKET_MAX=2560
export VLLM_DECODE_BS_BUCKET_MIN=100
export VLLM_DECODE_BS_BUCKET_STEP=100
export VLLM_DECODE_BS_BUCKET_MAX=100
export VLLM_ENGINE_ITERATION_TIMEOUT_S=100

export PT_HPU_RECIPE_CACHE_CONFIG=./cached_recipes,false,5000
export VLLM_PROMPT_USE_FUSEDSDPA=1

python -m vllm.entrypoints.openai.api_server \
  --model /data/Qwen2-7B-Instruct/ \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --block-size 128 \
  --dtype bfloat16 \
  --max-num-seqs 128 \
  --gpu-memory-utilization 0.9 \
  --disable-log-requests \
  --host 0.0.0.0 \
  --port 8111

🐛 Describe the bug

INFO 09-03 07:45:49 habana_model_runner.py:486] Loading model weights took in total 14.21 GiB of device memory (14.22 GiB/93.55 GiB used) and 1.984 GiB of host memory (48.26 GiB/1007 GiB used)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 312, in <module>
[rank0]:     asyncio.run(run_server(args))
[rank0]:   File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]:     return loop.run_until_complete(main)
[rank0]:   File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]:     return future.result()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 289, in run_server
[rank0]:     app = await init_app(args, llm_engine)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 229, in init_app
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/engine/async_llm_engine.py", line 479, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/engine/async_llm_engine.py", line 560, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/engine/llm_engine.py", line 266, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/engine/llm_engine.py", line 365, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/executor/habana_executor.py", line 77, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/worker/habana_worker.py", line 142, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/worker/habana_model_runner.py", line 1090, in profile_run
[rank0]:     self.warmup_scenario(max_batch_size,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/worker/habana_model_runner.py", line 1152, in warmup_scenario
[rank0]:     self.execute_model(inputs, kv_caches, warmup_mode=True)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/worker/habana_model_runner.py", line 1612, in execute_model
[rank0]:     output = self.model.sample(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/worker/habana_model_runner.py", line 205, in sample
[rank0]:     return self.model.sample(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/model_executor/models/qwen2.py", line 359, in sample
[rank0]:     next_tokens = self.sampler(logits, sampling_metadata)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/model_executor/layers/sampler.py", line 138, in forward
[rank0]:     sample_results, maybe_sampled_tokens_tensor = _sample(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/model_executor/layers/sampler.py", line 712, in _sample
[rank0]:     return _sample_with_torch(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/model_executor/layers/sampler.py", line 593, in _sample_with_torch
[rank0]:     sample_results = _greedy_sample(seq_groups, greedy_samples)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+gaudi117-py3.10.egg/vllm/model_executor/layers/sampler.py", line 336, in _greedy_sample
[rank0]:     print("jane samples :", samples)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 474, in __repr__
[rank0]:     return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 697, in _str
[rank0]:     return _str_intern(self, tensor_contents=tensor_contents)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 617, in _str_intern
[rank0]:     tensor_str = _tensor_str(self, indent)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 349, in _tensor_str
[rank0]:     formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor_str.py", line 133, in __init__
[rank0]:     value_str = f"{value}"
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 1000, in __format__
[rank0]:     return self.item().__format__(format_spec)
[rank0]: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
xuechendi commented 2 months ago

@Zjq9409 , what is the test_message? I tested with exact same branch(v0.5.3.post1-Gaudi-1.17.0) w/ same env var and serving settings, it went through successfully.

image

image image

from openai import OpenAI

if __name__ == "__main__":

    model = "Qwen/Qwen2-7B-Instruct"
    #model = "meta-llama/Meta-Llama-3.1-8B-Instruct"
    llm = OpenAI(base_url="http://100.83.111.250:8000/v1", api_key="EMPTY")

    input = [{"role": "user", "content": "你是谁?"}]

    output = llm.chat.completions.create(
        model=model,
        messages=input,
        stream=True,
        max_tokens=128,
    )

    for chunk in output:
        if cont := chunk.choices[0].delta.content:
            print(cont, end='', flush=True)

    print()
xuechendi commented 2 months ago

@Zjq9409 , I didn't see "VLLM_PROMPT_USE_FUSEDSDPA" enabled in "v0.5.3.post1-Gaudi-1.17.0". If you would like to test with FusedSDPA, may need to switch to "habana_main" branch

Zjq9409 commented 2 months ago

serving

I tested it using habana-main

Zjq9409 commented 2 months ago

@Zjq9409 , what is the test_message? I tested with exact same branch(v0.5.3.post1-Gaudi-1.17.0) w/ same env var and serving settings, it went through successfully.

image

image image

from openai import OpenAI

if __name__ == "__main__":

    model = "Qwen/Qwen2-7B-Instruct"
    #model = "meta-llama/Meta-Llama-3.1-8B-Instruct"
    llm = OpenAI(base_url="http://100.83.111.250:8000/v1", api_key="EMPTY")

    input = [{"role": "user", "content": "你是谁?"}]

    output = llm.chat.completions.create(
        model=model,
        messages=input,
        stream=True,
        max_tokens=128,
    )

    for chunk in output:
        if cont := chunk.choices[0].delta.content:
            print(cont, end='', flush=True)

    print()

which branch are you use?

xuechendi commented 2 months ago

serving

I tested it using habana-main

according to the log you provided, you're testing on 'vllm-0.5.3.post1+gaudi117', which is behind habana-main. And are you testing on G2D or G2H?

BTW, I tested on both habana_main and the exact same branch (vllm-0.5.3.post1+gaudi117). Both works ok on QWen2-7B with same configuration on G2H.

michalkuligowski commented 1 month ago

Hi @Zjq9409 do you still observer the issue or can it be closed?

michalkuligowski commented 2 weeks ago

Closing due to no update from author, please open if issue occurs on latest version