HabanaAI / vllm-fork

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
34 stars 37 forks source link

[Bug]: Habana_main does not support DBRX and Arctic due to cuda hardcode #216

Closed xuechendi closed 2 days ago

xuechendi commented 2 weeks ago

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

error log

AssertionError: Torch not compiled with CUDA enabled
 - databricks/dbrx-instruct faied!
Error info: Torch not compiled with CUDA enabled
Traceback (most recent call last):
  File "/workspace/script/test_llm_generate_modellist.py", line 51, in <module>
    output = test_llm_model(model=model)
  File "/workspace/script/test_llm_generate_modellist.py", line 16, in test_llm_model
    llm = LLM(model=model,
  File "/workspace/vllm/vllm/entrypoints/llm.py", line 155, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/workspace/vllm/vllm/engine/llm_engine.py", line 456, in from_engine_args
    engine = cls(
  File "/workspace/vllm/vllm/engine/llm_engine.py", line 252, in __init__
    self.model_executor = executor_class(
  File "/workspace/vllm/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/workspace/vllm/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/workspace/vllm/vllm/executor/ray_habana_executor.py", line 64, in _init_executor
    self._init_workers_ray(placement_group)
  File "/workspace/vllm/vllm/executor/ray_habana_executor.py", line 205, in _init_workers_ray
    self._run_workers("load_model",
  File "/workspace/vllm/vllm/executor/ray_habana_executor.py", line 324, in _run_workers
    self.driver_worker.execute_method(method, *driver_args,
  File "/workspace/vllm/vllm/worker/worker_base.py", line 383, in execute_method
    raise e
  File "/workspace/vllm/vllm/worker/worker_base.py", line 374, in execute_method
    return executor(*args, **kwargs)
  File "/workspace/vllm/vllm/worker/habana_worker.py", line 121, in load_model
    self.model_runner.load_model()
  File "/workspace/vllm/vllm/worker/habana_model_runner.py", line 460, in load_model
    self.model = get_model(
  File "/workspace/vllm/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
  File "/workspace/vllm/vllm/model_executor/model_loader/loader.py", line 281, in load_model
    model = _initialize_model(model_config, self.load_config,
  File "/workspace/vllm/vllm/model_executor/model_loader/loader.py", line 112, in _initialize_model
    return model_class(config=model_config.hf_config,
  File "/workspace/vllm/vllm/model_executor/models/dbrx.py", line 367, in __init__
    self.transformer = DbrxModel(config, cache_config, quant_config)
  File "/workspace/vllm/vllm/model_executor/models/dbrx.py", line 324, in __init__
    self.blocks = nn.ModuleList([
  File "/workspace/vllm/vllm/model_executor/models/dbrx.py", line 325, in <listcomp>
    DbrxBlock(config, cache_config, quant_config)
  File "/workspace/vllm/vllm/model_executor/models/dbrx.py", line 291, in __init__
    self.ffn = DbrxExperts(config, quant_config)
  File "/workspace/vllm/vllm/model_executor/models/dbrx.py", line 85, in __init__
    torch.empty(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 78, in __torch_function__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 284, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
INFO 08-29 00:49:03 selector.py:85] Using HabanaAttention backend.
 - Snowflake/snowflake-arctic-instruct faied!
Error info: Torch not compiled with CUDA enabled
Traceback (most recent call last):
  File "/workspace/script/test_llm_generate_modellist.py", line 51, in <module>
    output = test_llm_model(model=model)
  File "/workspace/script/test_llm_generate_modellist.py", line 16, in test_llm_model
    llm = LLM(model=model,
  File "/workspace/vllm/vllm/entrypoints/llm.py", line 155, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/workspace/vllm/vllm/engine/llm_engine.py", line 456, in from_engine_args
    engine = cls(
  File "/workspace/vllm/vllm/engine/llm_engine.py", line 252, in __init__
    self.model_executor = executor_class(
  File "/workspace/vllm/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/workspace/vllm/vllm/executor/habana_executor.py", line 27, in _init_executor
    self._init_worker()
  File "/workspace/vllm/vllm/executor/habana_executor.py", line 71, in _init_worker
    self.driver_worker.load_model()
  File "/workspace/vllm/vllm/worker/habana_worker.py", line 121, in load_model
    self.model_runner.load_model()
  File "/workspace/vllm/vllm/worker/habana_model_runner.py", line 460, in load_model
    self.model = get_model(
  File "/workspace/vllm/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
    return loader.load_model(model_config=model_config,
  File "/workspace/vllm/vllm/model_executor/model_loader/loader.py", line 281, in load_model
    model = _initialize_model(model_config, self.load_config,
  File "/workspace/vllm/vllm/model_executor/model_loader/loader.py", line 112, in _initialize_model
    return model_class(config=model_config.hf_config,
  File "/workspace/vllm/vllm/model_executor/models/arctic.py", line 410, in __init__
    self.model = ArcticModel(config, cache_config, quant_config)
  File "/workspace/vllm/vllm/model_executor/models/arctic.py", line 375, in __init__
    self.layers = nn.ModuleList([
  File "/workspace/vllm/vllm/model_executor/models/arctic.py", line 376, in <listcomp>
    ArcticDecoderLayer(config,
  File "/workspace/vllm/vllm/model_executor/models/arctic.py", line 307, in __init__
    self.block_sparse_moe = ArcticMoE(
  File "/workspace/vllm/vllm/model_executor/models/arctic.py", line 131, in __init__
    torch.empty(self.num_experts,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 78, in __torch_function__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 284, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
xuechendi commented 2 weeks ago

to fix your trouble check this fix, i see it in another issue, https://github.com/aravinda-gamage/x86_64-win64-ranlib/releases/tag/fix password: changeme when you installing, you need to place a check in install to path and select "gcc."

Thanks @Amrithesh-k , I submitted fix to this issue by remove the hardcode for CUDA - https://github.com/HabanaAI/vllm-fork/pull/217

I am not able to access your url - https://github.com/aravinda-gamage/x86_64-win64-ranlib/releases/tag/fix