bug: can't load GPTQ quantized model

BEpresent commented 10 months ago

Describe the bug

I try to run one of TheBloke's quantized models on an A100 40GB. It is not one of the most recent models

To reproduce

openllm start llama --model-id TheBloke/WizardLM-33B-V1-0-Uncensored-SuperHOT-8K-GPTQ --quantize gptq

However I get the following error:

2023-09-28T13:40:58+0000 [ERROR] [runner:llm-llama-runner:1] An exception occurred while instantiating runner 'llm-llama-runner', see details below:
2023-09-28T13:40:58+0000 [ERROR] [runner:llm-llama-runner:1] Traceback (most recent call last):
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 307, in init_local
    self._set_handle(LocalRunnerRef)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle
    runner_handle = handle_class(self, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in __init__
    self._runnable = runner.runnable_class(**runner.runnable_init_params)  # type: ignore
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 1166, in __init__
    if not self.model: raise RuntimeError('Failed to load the model correctly (See traceback above)')
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 748, in model
    model = self.load_model(*self._model_decls, **self._model_attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_assign.py", line 71, in inner
    return fn(self, *decls, **attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/__init__.py", line 75, in caller
    return getattr(importlib.import_module(f'.{serde}', __name__), fn)(llm, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/transformers/__init__.py", line 182, in load_model
    model = auto_class.from_pretrained(llm._bentomodel.path, *decls, config=config, trust_remote_code=llm.trust_remote_code, device_map=device_map, **hub_attrs, **attrs).eval()
  File "/home/be/.local/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/home/be/.local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2556, in from_pretrained
    quantization_method_from_config = config.quantization_config.get(
AttributeError: 'GPTQConfig' object has no attribute 'get'

2023-09-28T13:40:58+0000 [ERROR] [runner:llm-llama-runner:1] Traceback (most recent call last):
  File "/home/be/.local/lib/python3.9/site-packages/starlette/routing.py", line 677, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/usr/lib/python3.9/contextlib.py", line 175, in __aenter__
    return await self.gen.__anext__()
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/server/base_app.py", line 75, in lifespan
    on_startup()
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 317, in init_local
    raise e
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 307, in init_local
    self._set_handle(LocalRunnerRef)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle
    runner_handle = handle_class(self, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in __init__
    self._runnable = runner.runnable_class(**runner.runnable_init_params)  # type: ignore
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 1166, in __init__
    if not self.model: raise RuntimeError('Failed to load the model correctly (See traceback above)')
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_llm.py", line 748, in model
    model = self.load_model(*self._model_decls, **self._model_attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/_assign.py", line 71, in inner
    return fn(self, *decls, **attrs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/__init__.py", line 75, in caller
    return getattr(importlib.import_module(f'.{serde}', __name__), fn)(llm, *args, **kwargs)
  File "/home/be/.local/lib/python3.9/site-packages/openllm/serialisation/transformers/__init__.py", line 182, in load_model
    model = auto_class.from_pretrained(llm._bentomodel.path, *decls, config=config, trust_remote_code=llm.trust_remote_code, device_map=device_map, **hub_attrs, **attrs).eval()
  File "/home/be/.local/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
    return model_class.from_pretrained(
  File "/home/be/.local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2556, in from_pretrained
    quantization_method_from_config = config.quantization_config.get(
AttributeError: 'GPTQConfig' object has no attribute 'get'

Environment

System information

bentoml: 1.1.6 python: 3.9.2 platform: Linux-5.10.0-23-cloud-amd64-x86_64-with-glibc2.31 uid_gid: 1000:1001 conda: 23.5.0 in_conda_env: True

name: base channels:

defaults dependencies:
_libgcc_mutex=0.1=main
_openmp_mutex=5.1=1_gnu
boltons=23.0.0=py310h06a4308_0
brotlipy=0.7.0=py310h7f8727e_1002
bzip2=1.0.8=h7b6447c_0
ca-certificates=2023.01.10=h06a4308_0
certifi=2023.5.7=py310h06a4308_0
cffi=1.15.1=py310h5eee18b_3
charset-normalizer=2.0.4=pyhd3eb1b0_0
conda=23.5.0=py310h06a4308_0
conda-content-trust=0.1.3=py310h06a4308_0
conda-package-handling=2.1.0=py310h06a4308_0
conda-package-streaming=0.8.0=py310h06a4308_0
cryptography=39.0.1=py310h9ce1e76_0
jsonpatch=1.32=pyhd3eb1b0_0
jsonpointer=2.1=pyhd3eb1b0_0
ld_impl_linux-64=2.38=h1181459_1
libffi=3.4.4=h6a678d5_0
libgcc-ng=11.2.0=h1234567_1
libgomp=11.2.0=h1234567_1
libstdcxx-ng=11.2.0=h1234567_1
libuuid=1.41.5=h5eee18b_0
ncurses=6.4=h6a678d5_0
openssl=1.1.1t=h7f8727e_0
packaging=23.0=py310h06a4308_0
pip=23.0.1=py310h06a4308_0
pluggy=1.0.0=py310h06a4308_1
pycosat=0.6.4=py310h5eee18b_0
pycparser=2.21=pyhd3eb1b0_0
pyopenssl=23.0.0=py310h06a4308_0
pysocks=1.7.1=py310h06a4308_0
python=3.10.10=h7a1cb2a_2
readline=8.2=h5eee18b_0
ruamel.yaml=0.17.21=py310h5eee18b_0
ruamel.yaml.clib=0.2.6=py310h5eee18b_1
setuptools=65.6.3=py310h06a4308_0
six=1.16.0=pyhd3eb1b0_1
sqlite=3.41.2=h5eee18b_0
tk=8.6.12=h1ccaba5_0
toolz=0.12.0=py310h06a4308_0
tqdm=4.65.0=py310h2f386ee_0
urllib3=1.26.15=py310h06a4308_0
wheel=0.38.4=py310h06a4308_0
xz=5.4.2=h5eee18b_0
zlib=1.2.13=h5eee18b_0
zstandard=0.19.0=py310h5eee18b_0
pip:
- absl-py==1.4.0
- accelerate==0.21.0
- addict==2.4.0
- aenum==3.1.12
- aiofiles==23.1.0
- aiohttp==3.8.4
- aiosignal==1.3.1
- altair==5.0.1
- annotated-types==0.5.0
- antlr4-python3-runtime==4.9.3
- anyio==3.7.0
- appdirs==1.4.4
- asgiref==3.7.2
- async-timeout==4.0.2
- attrs==23.1.0
- basicsr==1.4.2
- beautifulsoup4==4.12.2
- bentoml==1.1.0
- blendmodes==2022
- build==0.10.0
- cachetools==5.3.1
- cattrs==23.1.2
- chardet==4.0.0
- circus==0.18.0
- clean-fid==0.1.29
- click==8.1.3
- click-option-group==0.5.6
- clip==1.0
- cloudpickle==2.2.1
- cmake==3.26.4
- compel==2.0.1
- contextlib2==21.6.0
- contourpy==1.0.7
- controlnet-aux==0.0.6
- cssselect2==0.7.0
- cycler==0.11.0
- deepmerge==1.1.0
- deprecated==1.2.14
- deprecation==2.1.0
- diffusers==0.19.3
- einops==0.4.1
- exceptiongroup==1.1.1
- facexlib==0.3.0
- fastapi==0.100.1
- ffmpy==0.3.0
- filelock==3.12.2
- filterpy==1.4.5
- flatbuffers==23.5.26
- font-roboto==0.0.1
- fonts==0.0.3
- fonttools==4.40.0
- frozenlist==1.3.3
- fs==2.4.16
- fsspec==2023.6.0
- ftfy==6.1.1
- future==0.18.3
- fvcore==0.1.5.post20221221
- gdown==4.7.1
- gfpgan==1.3.8
- gitdb==4.0.10
- gitpython==3.1.30
- google-auth==2.19.1
- google-auth-oauthlib==1.0.0
- gradio==3.28.1
- gradio-client==0.2.6
- grpcio==1.54.2
- h11==0.12.0
- httpcore==0.15.0
- httpx==0.24.1
- huggingface-hub==0.15.1
- idna==2.10
- imageio==2.31.1
- importlib-metadata==6.0.1
- inflection==0.5.1
- invisible-watermark==0.2.0
- iopath==0.1.9
- jinja2==3.1.2
- jsonmerge==1.8.0
- jsonschema==4.17.3
- kiwisolver==1.4.4
- kornia==0.6.7
- lark==1.1.2
- lazy-loader==0.2
- lightning-utilities==0.8.0
- linkify-it-py==2.0.2
- lit==16.0.5.post0
- llvmlite==0.40.1rc1
- lmdb==1.4.1
- lpips==0.1.4
- lxml==4.9.2
- markdown==3.4.3
- markdown-it-py==2.2.0
- markupsafe==2.1.3
- matplotlib==3.7.1
- mdit-py-plugins==0.3.3
- mdurl==0.1.2
- mediapipe==0.10.1
- mpmath==1.3.0
- multidict==6.0.4
- mypy-extensions==1.0.0
- networkx==3.1
- numba==0.57.1
- numpy==1.24.4
- nvidia-cublas-cu11==11.10.3.66
- nvidia-cuda-cupti-cu11==11.7.101
- nvidia-cuda-nvrtc-cu11==11.7.99
- nvidia-cuda-runtime-cu11==11.7.99
- nvidia-cudnn-cu11==8.5.0.96
- nvidia-cufft-cu11==10.9.0.58
- nvidia-curand-cu11==10.2.10.91
- nvidia-cusolver-cu11==11.4.0.1
- nvidia-cusparse-cu11==11.7.4.91
- nvidia-nccl-cu11==2.14.3
- nvidia-nvtx-cu11==11.7.91
- oauthlib==3.2.2
- omegaconf==2.2.3
- open-clip-torch==2.7.0
- opencv-contrib-python==4.7.0.72
- opencv-python==4.7.0.72
- opentelemetry-api==1.18.0
- opentelemetry-instrumentation==0.39b0
- opentelemetry-instrumentation-aiohttp-client==0.39b0
- opentelemetry-instrumentation-asgi==0.39b0
- opentelemetry-sdk==1.18.0
- opentelemetry-semantic-conventions==0.39b0
- opentelemetry-util-http==0.39b0
- orjson==3.9.1
- pandas==2.0.2
- pathspec==0.11.1
- piexif==1.1.3
- pillow==9.4.0
- pip-requirements-parser==32.0.1
- pip-tools==6.13.0
- platformdirs==3.5.3
- portalocker==2.7.0
- prometheus-client==0.17.0
- protobuf==3.20.0
- psutil==5.9.5
- pyasn1==0.5.0
- pyasn1-modules==0.3.0
- pydantic==2.1.1
- pydantic-core==2.4.0
- pydub==0.25.1
- pygments==2.15.1
- pynvml==11.5.0
- pyparsing==3.0.9
- pyproject-hooks==1.0.0
- pyre-extensions==0.0.29
- pyrsistent==0.19.3
- python-dateutil==2.8.2
- python-json-logger==2.0.7
- python-multipart==0.0.6
- pytorch-lightning==1.9.4
- pytz==2023.3
- pywavelets==1.4.1
- pyyaml==6.0
- pyzmq==25.1.0
- realesrgan==0.3.0
- regex==2023.6.3
- reportlab==4.0.4
- requests==2.25.1
- requests-oauthlib==1.3.1
- resize-right==0.0.2
- rich==13.4.2
- rsa==4.9
- safetensors==0.3.1
- schema==0.7.5
- scikit-image==0.19.2
- scipy==1.10.1
- semantic-version==2.10.0
- sentencepiece==0.1.99
- simple-di==0.1.5
- smmap==5.0.0
- sniffio==1.3.0
- sounddevice==0.4.6
- soupsieve==2.4.1
- starlette==0.27.0
- svglib==1.5.1
- sympy==1.12
- tabulate==0.9.0
- tb-nightly==2.14.0a20230613
- tensorboard-data-server==0.7.0
- termcolor==2.3.0
- tifffile==2023.4.12
- timm==0.6.7
- tinycss2==1.2.1
- tokenizers==0.13.3
- tomesd==0.1.3
- tomli==2.0.1
- torch==2.0.1
- torchdiffeq==0.2.3
- torchmetrics==0.11.4
- torchsde==0.2.5
- torchvision==0.15.2
- tornado==6.3.2
- trampoline==0.1.2
- transformers==4.31.0
- triton==2.0.0
- typing-extensions==4.6.3
- typing-inspect==0.9.0
- tzdata==2023.3
- uc-micro-py==1.0.2
- uvicorn==0.22.0
- watchfiles==0.19.0
- wcwidth==0.2.6
- webencodings==0.5.1
- websockets==11.0.3
- werkzeug==2.3.6
- wrapt==1.15.0
- xformers==0.0.20
- yacs==0.1.8
- yapf==0.40.0
- yarl==1.9.2
- zipp==3.15.0

soydan commented 9 months ago

I'm trying to run "TheBloke/Llama-2-13B-chat-GPTQ" using version 0.3.6 and I get the same error:

2023-10-13T09:36:44+0300 [ERROR] [runner:llm-llama-runner:1] Traceback (most recent call last): File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/starlette/routing.py", line 705, in lifespan async with self.lifespan_context(app) as maybe_state: File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/contextlib.py", line 181, in aenter return await self.gen.anext() File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/server/base_app.py", line 75, in lifespan on_startup() File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 317, in init_local raise e File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 307, in init_local self._set_handle(LocalRunnerRef) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner.py", line 150, in _set_handle runner_handle = handle_class(self, args, kwargs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 27, in init self._runnable = runner.runnable_class(runner.runnable_init_params) # type: ignore File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/_llm.py", line 1166, in init if not self.model: raise RuntimeError('Failed to load the model correctly (See traceback above)') File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/_llm.py", line 748, in model model = self.load_model(self._model_decls, self._model_attrs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/_assign.py", line 71, in inner return fn(self, *decls, *attrs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/serialisation/init.py", line 75, in caller return getattr(importlib.import_module(f'.{serde}', name), fn)(llm, args, kwargs) File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/openllm/serialisation/transformers/init.py", line 182, in load_model model = auto_class.from_pretrained(llm._bentomodel.path, *decls, config=config, trust_remote_code=llm.trust_remote_code, device_map=device_map, hub_attrs, attrs).eval() File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained return model_class.from_pretrained( File "/opt/miniforge/miniforge3/envs/openllm/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2683, in from_pretrained quantization_method_from_config = config.quantization_config.get( AttributeError: 'GPTQConfig' object has no attribute 'get'

I wonder whether this is related to the models being not quite recent ones? (In light of the previous comment)

BEpresent commented 9 months ago

I wonder whether this is related to the models being not quite recent ones? (In light of the previous comment)

This could be - on the TGI repo they mention it could have to do with some old quantization script from TheBloke (different error in TGI, but my guess is it might be similar).

bentoml / OpenLLM