bug: load llama model from local path but got error 'Cannot find commit hash in LlamaConfig'

Describe the bug

I'm using conda to create env with python 3.10.12, and install related package using

pip install "openllm[llama, vllm]"

when i start a llama service using

openllm start llama --model-id /home/user/models/Llama-2-70B-chat-GPTQ --quantize gptq --workers-per-resource 0.125

it works fine in openllm==0.3.9 However, an error occurred in version 0.4.0 and 0.4.1, i've tried -hf -gptq -awq model from TheBloke in huggingface, and got the same.

To reproduce

install requirements

pip install "openllm[llama, vllm]"

start server

openllm start llama --model-id /home/user/models/Llama-2-70B-Chat-AWQ --backend vllm --workers-per-resource 0.125 --quantise awq

Logs

Traceback (most recent call last):
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/serialisation/transformers/__init__.py", line 129, in get
    model = bentoml.models.get(llm.tag)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/simple_di/__init__.py", line 139, in _
    return func(*_inject_args(bind.args), **_inject_kwargs(bind.kwargs))
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/models.py", line 45, in get
    return _model_store.get(tag)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/store.py", line 158, in get
    raise NotFound(
bentoml.exceptions.NotFound: Model 'vllm-llama-2-70b-chat-awq:14f4806647c05a0905cd70c55651c9ba7bde8a56' is not found in BentoML store <osfs '/home/user/bentoml/models'>, you may need to run `bentoml models pull` first

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/cli/entrypoint.py", line 381, in import_command
    _ref = openllm.serialisation.get(llm)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/serialisation/__init__.py", line 78, in caller
    return getattr(importlib.import_module(f'.{serde}', __name__), fn)(llm, *args, **kwargs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/serialisation/transformers/__init__.py", line 136, in get
    raise openllm.exceptions.OpenLLMException(f'Failed while getting stored artefact (lookup for traceback):\n{err}') from err
openllm_core.exceptions.OpenLLMException: Failed while getting stored artefact (lookup for traceback):
Model 'vllm-llama-2-70b-chat-awq:14f4806647c05a0905cd70c55651c9ba7bde8a56' is not found in BentoML store <osfs '/home/user/bentoml/models'>, you may need to run `bentoml models pull` first

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/serialisation/transformers/__init__.py", line 129, in get
    model = bentoml.models.get(llm.tag)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/simple_di/__init__.py", line 139, in _
    return func(*_inject_args(bind.args), **_inject_kwargs(bind.kwargs))
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/models.py", line 45, in get
    return _model_store.get(tag)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/bentoml/_internal/store.py", line 158, in get
    raise NotFound(
bentoml.exceptions.NotFound: Model 'vllm-llama-2-70b-chat-awq:14f4806647c05a0905cd70c55651c9ba7bde8a56' is not found in BentoML store <osfs '/home/user/bentoml/models'>, you may need to run `bentoml models pull` first

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/miniconda3/envs/openllm/bin/openllm", line 8, in <module>
    sys.exit(cli())
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/cli/entrypoint.py", line 186, in wrapper
    return_value = func(*args, **attrs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/cli/entrypoint.py", line 168, in wrapper
    return f(*args, **attrs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/cli/_factory.py", line 194, in start_cmd
    llm.save_pretrained()  # ensure_available = True
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/_llm.py", line 233, in save_pretrained
    def save_pretrained(self)->bentoml.Model:return openllm.import_model(self.config['start_name'], model_id=self.model_id, model_version=self._revision, backend=self.__llm_backend__, quantize=self._quantise)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/cli/_sdk.py", line 262, in _import_model
    return import_command.main(args=args, standalone_mode=False)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/cli/entrypoint.py", line 186, in wrapper
    return_value = func(*args, **attrs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/cli/entrypoint.py", line 168, in wrapper
    return f(*args, **attrs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/cli/entrypoint.py", line 387, in import_command
    _ref = openllm.serialisation.get(llm, auto_import=True)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/serialisation/__init__.py", line 78, in caller
    return getattr(importlib.import_module(f'.{serde}', __name__), fn)(llm, *args, **kwargs)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/serialisation/transformers/__init__.py", line 135, in get
    if auto_import: return import_model(llm, trust_remote_code=llm.trust_remote_code)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/simple_di/__init__.py", line 139, in _
    return func(*_inject_args(bind.args), **_inject_kwargs(bind.kwargs))
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/serialisation/transformers/__init__.py", line 64, in import_model
    metadata['_revision'] = get_hash(config)
  File "/home/user/miniconda3/envs/openllm/lib/python3.10/site-packages/openllm/serialisation/transformers/_helpers.py", line 22, in get_hash
    if _commit_hash is None: raise ValueError(f'Cannot find commit hash in {config}')
ValueError: Cannot find commit hash in LlamaConfig {
  "_name_or_path": "/home/user/models/Llama-2-70B-Chat-AWQ",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "group_size": 128,
    "quant_method": "awq",
    "version": "gemm",
    "zero_point": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.35.0",
  "use_cache": true,
  "vocab_size": 32000
}

Environment

Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

System information

bentoml: 1.1.9 python: 3.10.12 platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31 uid_gid: 53201113:53200513 conda: 23.9.0 in_conda_env: True

conda_packages

```yaml name: openllm channels: - defaults dependencies: - _libgcc_mutex=0.1=main - _openmp_mutex=5.1=1_gnu - bzip2=1.0.8=h7b6447c_0 - ca-certificates=2023.08.22=h06a4308_0 - ld_impl_linux-64=2.38=h1181459_1 - libffi=3.4.4=h6a678d5_0 - libgcc-ng=11.2.0=h1234567_1 - libgomp=11.2.0=h1234567_1 - libstdcxx-ng=11.2.0=h1234567_1 - libuuid=1.41.5=h5eee18b_0 - ncurses=6.4=h6a678d5_0 - openssl=3.0.12=h7f8727e_0 - pip=23.3=py310h06a4308_0 - python=3.10.12=h955ad1f_0 - readline=8.2=h5eee18b_0 - setuptools=68.0.0=py310h06a4308_0 - sqlite=3.41.2=h5eee18b_0 - tk=8.6.12=h1ccaba5_0 - wheel=0.41.2=py310h06a4308_0 - xz=5.4.2=h5eee18b_0 - zlib=1.2.13=h5eee18b_0 - pip: - accelerate==0.24.1 - aiohttp==3.8.6 - aiosignal==1.3.1 - anyio==3.7.1 - appdirs==1.4.4 - asgiref==3.7.2 - async-timeout==4.0.3 - attrs==23.1.0 - bentoml==1.1.9 - bitsandbytes==0.41.2.post1 - build==1.0.3 - cattrs==23.1.2 - certifi==2023.7.22 - charset-normalizer==3.3.2 - circus==0.18.0 - click==8.1.7 - click-option-group==0.5.6 - cloudpickle==3.0.0 - cmake==3.27.7 - coloredlogs==15.0.1 - contextlib2==21.6.0 - cuda-python==12.3.0 - datasets==2.14.6 - deepmerge==1.1.0 - deprecated==1.2.14 - dill==0.3.7 - exceptiongroup==1.1.3 - fairscale==0.4.13 - fastapi==0.104.1 - fastcore==1.5.29 - filelock==3.13.1 - filetype==1.2.0 - frozenlist==1.4.0 - fs==2.4.16 - fsspec==2023.10.0 - ghapi==1.0.4 - h11==0.14.0 - httpcore==1.0.1 - httptools==0.6.1 - httpx==0.25.1 - huggingface-hub==0.17.3 - humanfriendly==10.0 - idna==3.4 - importlib-metadata==6.8.0 - inflection==0.5.1 - jinja2==3.1.2 - jsonschema==4.19.2 - jsonschema-specifications==2023.7.1 - lit==17.0.4 - markdown-it-py==3.0.0 - markupsafe==2.1.3 - mdurl==0.1.2 - mpmath==1.3.0 - msgpack==1.0.7 - multidict==6.0.4 - multiprocess==0.70.15 - mypy-extensions==1.0.0 - networkx==3.2.1 - ninja==1.11.1.1 - numpy==1.26.1 - nvidia-cublas-cu11==11.10.3.66 - nvidia-cuda-cupti-cu11==11.7.101 - nvidia-cuda-nvrtc-cu11==11.7.99 - nvidia-cuda-runtime-cu11==11.7.99 - nvidia-cudnn-cu11==8.5.0.96 - nvidia-cufft-cu11==10.9.0.58 - nvidia-curand-cu11==10.2.10.91 - nvidia-cusolver-cu11==11.4.0.1 - nvidia-cusparse-cu11==11.7.4.91 - nvidia-ml-py==11.525.150 - nvidia-nccl-cu11==2.14.3 - nvidia-nvtx-cu11==11.7.91 - openllm==0.4.1 - openllm-client==0.4.1 - openllm-core==0.4.1 - opentelemetry-api==1.20.0 - opentelemetry-instrumentation==0.41b0 - opentelemetry-instrumentation-aiohttp-client==0.41b0 - opentelemetry-instrumentation-asgi==0.41b0 - opentelemetry-sdk==1.20.0 - opentelemetry-semantic-conventions==0.41b0 - opentelemetry-util-http==0.41b0 - optimum==1.14.0 - orjson==3.9.10 - packaging==23.2 - pandas==2.1.2 - pathspec==0.11.2 - pillow==10.1.0 - pip-requirements-parser==32.0.1 - pip-tools==7.3.0 - prometheus-client==0.18.0 - protobuf==4.25.0 - psutil==5.9.6 - pyarrow==14.0.1 - pydantic==1.10.13 - pygments==2.16.1 - pyparsing==3.1.1 - pyproject-hooks==1.0.0 - python-dateutil==2.8.2 - python-dotenv==1.0.0 - python-json-logger==2.0.7 - python-multipart==0.0.6 - pytz==2023.3.post1 - pyyaml==6.0.1 - pyzmq==25.1.1 - ray==2.8.0 - referencing==0.30.2 - regex==2023.10.3 - requests==2.31.0 - rich==13.6.0 - rpds-py==0.12.0 - safetensors==0.4.0 - schema==0.7.5 - scipy==1.11.3 - sentencepiece==0.1.99 - simple-di==0.1.5 - six==1.16.0 - sniffio==1.3.0 - starlette==0.27.0 - sympy==1.12 - tabulate==0.9.0 - tokenizers==0.14.1 - tomli==2.0.1 - torch==2.0.1 - tornado==6.3.3 - tqdm==4.66.1 - transformers==4.35.0 - triton==2.0.0 - typing-extensions==4.8.0 - tzdata==2023.3 - urllib3==2.0.7 - uvicorn==0.24.0.post1 - uvloop==0.19.0 - vllm==0.2.1.post1 - watchfiles==0.21.0 - wcwidth==0.2.9 - websockets==12.0 - wrapt==1.15.0 - xformers==0.0.22 - xxhash==3.4.1 - yarl==1.9.2 - zipp==3.17.0 prefix: /home/user/miniconda3/envs/openllm ```

pip_packages

``` accelerate==0.24.1 aiohttp==3.8.6 aiosignal==1.3.1 anyio==3.7.1 appdirs==1.4.4 asgiref==3.7.2 async-timeout==4.0.3 attrs==23.1.0 bentoml==1.1.9 bitsandbytes==0.41.2.post1 build==1.0.3 cattrs==23.1.2 certifi==2023.7.22 charset-normalizer==3.3.2 circus==0.18.0 click==8.1.7 click-option-group==0.5.6 cloudpickle==3.0.0 cmake==3.27.7 coloredlogs==15.0.1 contextlib2==21.6.0 cuda-python==12.3.0 datasets==2.14.6 deepmerge==1.1.0 Deprecated==1.2.14 dill==0.3.7 exceptiongroup==1.1.3 fairscale==0.4.13 fastapi==0.104.1 fastcore==1.5.29 filelock==3.13.1 filetype==1.2.0 frozenlist==1.4.0 fs==2.4.16 fsspec==2023.10.0 ghapi==1.0.4 h11==0.14.0 httpcore==1.0.1 httptools==0.6.1 httpx==0.25.1 huggingface-hub==0.17.3 humanfriendly==10.0 idna==3.4 importlib-metadata==6.8.0 inflection==0.5.1 Jinja2==3.1.2 jsonschema==4.19.2 jsonschema-specifications==2023.7.1 lit==17.0.4 markdown-it-py==3.0.0 MarkupSafe==2.1.3 mdurl==0.1.2 mpmath==1.3.0 msgpack==1.0.7 multidict==6.0.4 multiprocess==0.70.15 mypy-extensions==1.0.0 networkx==3.2.1 ninja==1.11.1.1 numpy==1.26.1 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-ml-py==11.525.150 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 openllm==0.4.1 openllm-client==0.4.1 openllm-core==0.4.1 opentelemetry-api==1.20.0 opentelemetry-instrumentation==0.41b0 opentelemetry-instrumentation-aiohttp-client==0.41b0 opentelemetry-instrumentation-asgi==0.41b0 opentelemetry-sdk==1.20.0 opentelemetry-semantic-conventions==0.41b0 opentelemetry-util-http==0.41b0 optimum==1.14.0 orjson==3.9.10 packaging==23.2 pandas==2.1.2 pathspec==0.11.2 Pillow==10.1.0 pip-requirements-parser==32.0.1 pip-tools==7.3.0 prometheus-client==0.18.0 protobuf==4.25.0 psutil==5.9.6 pyarrow==14.0.1 pydantic==1.10.13 Pygments==2.16.1 pyparsing==3.1.1 pyproject_hooks==1.0.0 python-dateutil==2.8.2 python-dotenv==1.0.0 python-json-logger==2.0.7 python-multipart==0.0.6 pytz==2023.3.post1 PyYAML==6.0.1 pyzmq==25.1.1 ray==2.8.0 referencing==0.30.2 regex==2023.10.3 requests==2.31.0 rich==13.6.0 rpds-py==0.12.0 safetensors==0.4.0 schema==0.7.5 scipy==1.11.3 sentencepiece==0.1.99 simple-di==0.1.5 six==1.16.0 sniffio==1.3.0 starlette==0.27.0 sympy==1.12 tabulate==0.9.0 tokenizers==0.14.1 tomli==2.0.1 torch==2.0.1 tornado==6.3.3 tqdm==4.66.1 transformers==4.35.0 triton==2.0.0 typing_extensions==4.8.0 tzdata==2023.3 urllib3==2.0.7 uvicorn==0.24.0.post1 uvloop==0.19.0 vllm==0.2.1.post1 watchfiles==0.21.0 wcwidth==0.2.9 websockets==12.0 wrapt==1.15.0 xformers==0.0.22 xxhash==3.4.1 yarl==1.9.2 zipp==3.17.0 ```

System information (Optional)

No response

bentoml / OpenLLM