bentoml / OpenLLM

Run any open-source LLMs, such as Llama 3.1, Gemma, as OpenAI compatible API endpoint in the cloud.
https://bentoml.com
Apache License 2.0
9.43k stars 600 forks source link

bug: can't load GPTQ quantized model #420

Closed BEpresent closed 1 month ago

BEpresent commented 10 months ago

Describe the bug

I try to run one of TheBloke's quantized models on an A100 40GB. It is not one of the most recent models

To reproduce

openllm start llama --model-id TheBloke/WizardLM-33B-V1-0-Uncensored-SuperHOT-8K-GPTQ --quantize gptq

However I get the following error:

2023-09-28T13:40:58+0000 [ERROR] [runner:llm-llama-runner:1] An exception occurred while instantiating runner 'llm-llama-runner', see details below:
2023-09-28T13:40:58+0000 [ERROR] [runner:llm-llama-runner:1] Traceback (most recent call last):
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 307, in init_local
    self._set_handle(LocalRunnerRef)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle
    runner_handle = handle_class(self, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in __init__
    self._runnable = runner.runnable_class(**runner.runnable_init_params)  # type: ignore
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 1166, in __init__
    if not self.model: raise RuntimeError('Failed to load the model correctly (See traceback above)')
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 748, in model
    model = self.load_model(*self._model_decls, **self._model_attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_assign.py", line 71, in inner
    return fn(self, *decls, **attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/__init__.py", line 75, in caller
    return getattr(importlib.import_module(f'.{serde}', __name__), fn)(llm, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/transformers/__init__.py", line 182, in load_model
    model = auto_class.from_pretrained(llm._bentomodel.path, *decls, config=config, trust_remote_code=llm.trust_remote_code, device_map=device_map, **hub_attrs, **attrs).eval()
  File "/home/be/.local/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/home/be/.local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2556, in from_pretrained
    quantization_method_from_config = config.quantization_config.get(
AttributeError: 'GPTQConfig' object has no attribute 'get'

2023-09-28T13:40:58+0000 [ERROR] [runner:llm-llama-runner:1] Traceback (most recent call last):
  File "/home/be/.local/lib/python3.9/site-packages/starlette/routing.py", line 677, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/lib/python3.9/contextlib.py", line 175, in __aenter__
    return await self.gen.__anext__()
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/server/base_app.py", line 75, in lifespan
    on_startup()
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 317, in init_local
    raise e
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 307, in init_local
    self._set_handle(LocalRunnerRef)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle
    runner_handle = handle_class(self, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in __init__
    self._runnable = runner.runnable_class(**runner.runnable_init_params)  # type: ignore
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 1166, in __init__
    if not self.model: raise RuntimeError('Failed to load the model correctly (See traceback above)')
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 748, in model
    model = self.load_model(*self._model_decls, **self._model_attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_assign.py", line 71, in inner
    return fn(self, *decls, **attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/__init__.py", line 75, in caller
    return getattr(importlib.import_module(f'.{serde}', __name__), fn)(llm, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/transformers/__init__.py", line 182, in load_model
    model = auto_class.from_pretrained(llm._bentomodel.path, *decls, config=config, trust_remote_code=llm.trust_remote_code, device_map=device_map, **hub_attrs, **attrs).eval()
  File "/home/be/.local/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/home/be/.local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2556, in from_pretrained
    quantization_method_from_config = config.quantization_config.get(
AttributeError: 'GPTQConfig' object has no attribute 'get'

Environment

System information

bentoml: 1.1.6 python: 3.9.2 platform: Linux-5.10.0-23-cloud-amd64-x86_64-with-glibc2.31 uid_gid: 1000:1001 conda: 23.5.0 in_conda_env: True

name: base channels:

soydan commented 9 months ago

I'm trying to run "TheBloke/Llama-2-13B-chat-GPTQ" using version 0.3.6 and I get the same error:

2023-10-13T09:36:44+0300 [ERROR] [runner:llm-llama-runner:1] Traceback (most recent call last): File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/starlette/routing.py", line 705, in lifespan async with self.lifespan_context(app) as maybe_state: File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/contextlib.py", line 181, in aenter return await self.gen.anext() File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/server/base_app.py", line 75, in lifespan on_startup() File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 317, in init_local raise e File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 307, in init_local self._set_handle(LocalRunnerRef) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle runner_handle = handle_class(self, args, kwargs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in init self._runnable = runner.runnable_class(runner.runnable_init_params) # type: ignore File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/_llm.py", line 1166, in init if not self.model: raise RuntimeError('Failed to load the model correctly (See traceback above)') File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/_llm.py", line 748, in model model = self.load_model(self._model_decls, self._model_attrs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/_assign.py", line 71, in inner return fn(self, *decls, *attrs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/serialisation/init.py", line 75, in caller return getattr(importlib.import_module(f'.{serde}', name), fn)(llm, args, kwargs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/serialisation/transformers/init.py", line 182, in load_model model = auto_class.from_pretrained(llm._bentomodel.path, *decls, config=config, trust_remote_code=llm.trust_remote_code, device_map=device_map, hub_attrs, attrs).eval() File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained return model_class.from_pretrained( File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2683, in from_pretrained quantization_method_from_config = config.quantization_config.get( AttributeError: 'GPTQConfig' object has no attribute 'get'

I wonder whether this is related to the models being not quite recent ones? (In light of the previous comment)

BEpresent commented 9 months ago

I wonder whether this is related to the models being not quite recent ones? (In light of the previous comment)

This could be - on the TGI repo they mention it could have to do with some old quantization script from TheBloke (different error in TGI, but my guess is it might be similar).