Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.

Describe the bug

(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ lmdeploy chat --backend pytorch lm-deploy-042-smooth-quant-llama2-7b-hf
2024-06-21 05:54:12,555 - lmdeploy - INFO - Checking environment for PyTorch Engine.
2024-06-21 05:54:13,450 - lmdeploy - INFO - Checking model.
2024-06-21 05:54:13,450 - lmdeploy - WARNING - LMDeploy requires transformers version: [4.33.0 ~ 4.38.2], but found version: 4.40.2
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:13<00:00, 18.29s/it]
2024-06-21 05:55:27,119 - lmdeploy - INFO - Patching model.
2024-06-21 05:55:27,189 - lmdeploy - INFO - build CacheEngine with config:CacheConfig(block_size=64, num_cpu_blocks=128, num_gpu_blocks=802, window_size=-1, cache_max_entry_count=0.8, max_prefill_token_num=4096, enable_prefix_caching=False)
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
match template: <llama2>

double enter to end input >>> 
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

 [/INST]  
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

 [/INST] 

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

 [/INST] 

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

 [/INST] 

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coher

double enter to end input >>> <s>[INST]  [/INST]  huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

 [/INST] 

 [/INST] 

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something that is not correct. If you do not know the answer to a question, please do not share false information.
<</SYS>>

 [/INST] 

 [/INST] 

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something that is not correct. If you do not know the answer to a question, please do not share false information.
<</SYS>>

 [/INST] 

 [/INST] 

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something that is not correct. If you do not know the answer to a question, please do not share false information.
<</SYS>>

 [/INST]

Reproduction

使用lmdeploy 0.4.2 通过smooth_quant 量化了一个模型，结果量化后的模型无法使用和工作

Environment

(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ python -c "import torch; print('device count:',torch.cuda.device_count())"
device count: 8
(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ lmdeploy check_env
sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda-12.0
NVCC: Cuda compilation tools, release 12.0, V12.0.140
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.2.1+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.7  (built against CUDA 12.2)
    - Built with CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.1+cu118
LMDeploy: 0.4.2+9a00760
transformers: 4.40.2
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.2.0

Error traceback

No response

(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/zhangweixing/model$ lmdeploy chat --backend pytorch llama2-7b-hf/ 2024-06-21 07:11:28,011 - lmdeploy - INFO - Checking environment for PyTorch Engine. 2024-06-21 07:11:40,428 - lmdeploy - INFO - Checking model. 2024-06-21 07:11:40,428 - lmdeploy - WARNING - LMDeploy requires transformers version: [4.33.0 ~ 4.38.2], but found version: 4.40.2 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:46<00:00, 23.36s/it] 2024-06-21 07:12:28,526 - lmdeploy - INFO - Patching model. 2024-06-21 07:12:28,614 - lmdeploy - INFO - build CacheEngine with config:CacheConfig(block_size=64, num_cpu_blocks=128, num_gpu_blocks=650, window_size=-1, cache_max_entry_count=0.8, max_prefill_token_num=4096, enable_prefix_caching=False) match template:

double enter to end input >>> 请给我一首春天的诗歌

[INST] <> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>

请给我一首春天的诗歌 [/INST] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:

Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

[INST] <>

你是一个帮助，尊重和诚实的助手。总是回答最有助的方式，同时保持安全。你的答案不应包含任何不道德，暴力，种族主义，性别主义，毒质，危险，或非法内容。请确保你的回答是社会中立和正面的。

如果一个问题不是有意义，或者不是事实上有效的，请不要回答不正确的答案。如果你不知道答案，请不要分享不真实的信息。 <>

请给我一首春天的诗歌 [/INST]

[INST] <>

如果一个问题不是有意义，或者不是事实上有效的，请不要回答不正确的答案。如果你不知道答案，请不要分享不真实的信息。 <>

请给我一首春天的诗歌 [/INST]

[INST] <>

你是一个帮助，尊重和诚实的助手。总是回答最有助的方式，同时保持安全。你的答

double enter to end input >>> 我使用未量化的版本，也有这个问题。

之前，我记得提示符，不应该是 <|User|> <|Bot|>之类的嘛，这个运行是怎么回事。

请给我一首春天的诗歌 [/INST]

[INST] <>

你是一个帮助，尊重和诚实的助手。总是回答最有助的方式，同时保持安全。你的答

double enter to end input >>> 锄禾日当午

~~[INST] 锄禾日当午 [/INST]~~

[INST] <>

你是一个帮助，尊重和诚实的助手。总是回答最有助的方式，同时保持安全。你的答案不应包含任何不道德，暴力，种族主义，性别主义，毒质，危险，或非法内容。请确保你的回答是社会中立和正面的。

如果一个问题不是有意义，或者不是事实上有效的，请不要回答不正确的答案。如果你不知道答案，请不要分享不真实的信息。 <>

请给我一首春天的诗歌 [/INST]

[INST] <>

你是一个帮助，尊

使用为量化的LLama2-7B，或者smoothquant量化过的版本，都会出现这个思索，完全没有办法推理。

@AllentDan may help investigate it

不清楚是不是我的什么配置没有正确。

你用的大概不是 chat 模型

你用的大概不是 chat 模型

是的，是使用问题。base 模型不支持对话的哈 @CodexDive

确实不是chat模型，那如何使用base模型的量化版本？

加 --model-name base. 这块我们改下后面就不需要手动加了

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

这样的推理代码，可以直接使用量化后的模型吗？也需要添加这个--model-name base参数吗?

pipline 有 model_name 参数

PyTorch engine arguments:
  --adapters [ADAPTERS [ADAPTERS ...]]
                        Used to set path(s) of lora adapter(s). One can input key-value pairs in
                        xxx=yyy format for multiple lora adapters. If only have one adapter, one
                        can only input the path of the adapter.. Default: None. Type: str
  --tp TP               GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type:
                        int
  --model-name MODEL_NAME
                        The name of the to-be-deployed model, such as llama-7b, llama-13b,
                        vicuna-7b and etc. You can run `lmdeploy list` to get the supported model
                        names. Default: None. Type: str

这个参数解释，确实不明确。

--model-name 这个参数值，应该传递的值是'base'，还是'llama2'

(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ lmdeploy list
The older chat template name like "internlm2-7b", "qwen-7b" and so on are deprecated and will be removed in the future. The supported chat template names are:
baichuan2
chatglm
codellama
dbrx
deepseek
deepseek-coder
deepseek-vl
falcon
gemma
internlm
internlm-xcomposer2
internlm-xcomposer2-4khd
internlm2
internvl-internlm2
internvl-zh
internvl-zh-hermes2
llama
llama2
llama3
llava-chatml
llava-v1
mini-gemini-vicuna
mistral
mixtral
phi-3
puyu
qwen
solar
ultracm
ultralm
vicuna
wizardlm
yi
yi-vl

LMDeploy 目前是优先对话模型的处理方式。对于非对话模型，理论上用户只用考虑使用 api_server 的 /v1/completions 接口

那我如果使用lmdeploy来运行base版本，也只能用api_server模式吗？使用推理代码可以运行量化的模型吗？

因为llama2的chat模型无法找到，要支持llama2-7b和llama2-13b这类的量化模型。

import lmdeploy
pipe = lmdeploy.pipeline("you_awosome_w8a8_model_path")
response = pipe(["Hi, pls intro yourself", "Shanghai is"], do_preprocess=False)
print(response)

感谢，帮大忙了。幸亏你说了lmdeploy在驱动base版本以及chat版本的区别，不然真的是毫无头绪

我使用了lmdeploy 0.4.2量化了llama2-7B-hf模型，smoothquant量化没有报错， (lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ lmdeploy lite smooth_quant /mnt/self-define/zhangweixing/model/llama2-7b-hf/ --work-dir lm- deploy-042-smooth-quant-llama2-7b-hf Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.35it/s] Move model.embed_tokens to GPU. Move model.layers.0 to CPU. Move model.layers.1 to CPU. Move model.layers.2 to CPU. Move model.layers.3 to CPU. Move model.layers.4 to CPU. Move model.layers.5 to CPU. Move model.layers.6 to CPU. Move model.layers.7 to CPU. Move model.layers.8 to CPU. Move model.layers.9 to CPU. Move model.layers.10 to CPU. Move model.layers.11 to CPU. Move model.layers.12 to CPU. Move model.layers.13 to CPU. Move model.layers.14 to CPU. Move model.layers.15 to CPU. Move model.layers.16 to CPU. Move model.layers.17 to CPU. Move model.layers.18 to CPU. Move model.layers.19 to CPU. Move model.layers.20 to CPU. Move model.layers.21 to CPU. Move model.layers.22 to CPU. Move model.layers.23 to CPU. Move model.layers.24 to CPU. Move model.layers.25 to CPU. Move model.layers.26 to CPU. Move model.layers.27 to CPU. Move model.layers.28 to CPU. Move model.layers.29 to CPU. Move model.layers.30 to CPU. Move model.layers.31 to CPU. Move model.norm to GPU. Move lm_head to CPU. Loading calibrate dataset ... /home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/datasets/load.py:1429: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only You can avoid this message in future by passing the argumenttrust_remote_code=True. Passingtrust_remote_code=Truewill be mandatory to load this dataset from the next major release ofdatasets. warnings.warn( /home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/datasets/load.py:1429: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only You can avoid this message in future by passing the argumenttrust_remote_code=True. Passingtrust_remote_code=Truewill be mandatory to load this dataset from the next major release ofdatasets. warnings.warn( model.layers.0, samples: 128, max gpu memory: 6.63 GB model.layers.1, samples: 128, max gpu memory: 8.63 GB model.layers.2, samples: 128, max gpu memory: 8.63 GB model.layers.3, samples: 128, max gpu memory: 8.63 GB model.layers.4, samples: 128, max gpu memory: 8.63 GB model.layers.5, samples: 128, max gpu memory: 8.63 GB model.layers.6, samples: 128, max gpu memory: 8.63 GB model.layers.7, samples: 128, max gpu memory: 8.63 GB model.layers.8, samples: 128, max gpu memory: 8.63 GB model.layers.9, samples: 128, max gpu memory: 8.63 GB model.layers.10, samples: 128, max gpu memory: 8.63 GB model.layers.11, samples: 128, max gpu memory: 8.63 GB model.layers.12, samples: 128, max gpu memory: 8.63 GB model.layers.13, samples: 128, max gpu memory: 8.63 GB model.layers.14, samples: 128, max gpu memory: 8.63 GB model.layers.15, samples: 128, max gpu memory: 8.63 GB model.layers.16, samples: 128, max gpu memory: 8.63 GB model.layers.17, samples: 128, max gpu memory: 8.63 GB model.layers.18, samples: 128, max gpu memory: 8.63 GB model.layers.19, samples: 128, max gpu memory: 8.63 GB model.layers.20, samples: 128, max gpu memory: 8.63 GB model.layers.21, samples: 128, max gpu memory: 8.63 GB model.layers.22, samples: 128, max gpu memory: 8.63 GB model.layers.23, samples: 128, max gpu memory: 8.63 GB model.layers.24, samples: 128, max gpu memory: 8.63 GB model.layers.25, samples: 128, max gpu memory: 8.63 GB model.layers.26, samples: 128, max gpu memory: 8.63 GB model.layers.27, samples: 128, max gpu memory: 8.63 GB model.layers.28, samples: 128, max gpu memory: 8.63 GB model.layers.29, samples: 128, max gpu memory: 8.63 GB model.layers.30, samples: 128, max gpu memory: 8.63 GB model.layers.31, samples: 128, max gpu memory: 8.63 GB model.layers.0 smooth weight done. model.layers.1 smooth weight done. model.layers.2 smooth weight done. model.layers.3 smooth weight done. model.layers.4 smooth weight done. model.layers.5 smooth weight done. model.layers.6 smooth weight done. model.layers.7 smooth weight done. model.layers.8 smooth weight done. model.layers.9 smooth weight done. model.layers.10 smooth weight done. model.layers.11 smooth weight done. model.layers.12 smooth weight done. model.layers.13 smooth weight done. model.layers.14 smooth weight done. model.layers.15 smooth weight done. model.layers.16 smooth weight done. model.layers.17 smooth weight done. model.layers.18 smooth weight done. model.layers.19 smooth weight done. model.layers.20 smooth weight done. model.layers.21 smooth weight done. model.layers.22 smooth weight done. model.layers.23 smooth weight done. model.layers.24 smooth weight done. model.layers.25 smooth weight done. model.layers.26 smooth weight done. model.layers.27 smooth weight done. model.layers.28 smooth weight done. model.layers.29 smooth weight done. model.layers.30 smooth weight done. model.layers.31 smooth weight done. 但是，我使用推理代码运行时，无法加载量化的模型，如下图所示：

(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy/vllm_test$ python test_inference_smooth_quant_llama.py 
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Traceback (most recent call last):
  File "test_inference_smooth_quant_llama.py", line 3, in <module>
    pipe = lmdeploy.pipeline(model_path="/mnt/self-define/sunning/lmdeploy/lm-deploy-042-smooth-quant-llama2-7b-hf", model_name='base')
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/api.py", line 94, in pipeline
    return pipeline_class(model_path,
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/serve/async_engine.py", line 206, in __init__
    self._build_turbomind(model_path=model_path,
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/serve/async_engine.py", line 253, in _build_turbomind
    self.engine = tm.TurboMind.from_pretrained(
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 387, in from_pretrained
    return cls(model_path=pretrained_model_name_or_path,
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 161, in __init__
    self.model_comm = self._from_hf(model_source=model_source,
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 270, in _from_hf
    output_model = OUTPUT_MODELS.get(output_format)(
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/w4.py", line 80, in __init__
    super().__init__(input_model, cfg, to_file, out_dir)
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 156, in __init__
    self.cfg = self.get_config(cfg)
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/w4.py", line 92, in get_config
    w1s, _, _ = bin.ffn_scale(i)
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/source_model/llama_awq.py", line 52, in ffn_scale
    return ensure_fp16orint32(self._ffn(i, 'scales'))
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/source_model/llama.py", line 99, in _ffn
    tensor = self.params[
KeyError: 'model.layers.0.mlp.gate_proj.scales'

其中文件test_inference_smooth_quant_llama.py的内容如下所示：


import lmdeploy
pipe = lmdeploy.pipeline(model_path="/mnt/self-define/sunning/lmdeploy/lm-deploy-042-smooth-quant-llama2-7b-hf", model_name='base')
response = pipe(["中国的首都是", "Shanghai is"], do_preprocess=False)
print(response)

用下面的代码试试。

import lmdeploy
from lmdeploy.messages import PytorchEngineConfig
pipe = lmdeploy.pipeline(model_path="/mnt/self-define/sunning/lmdeploy/lm-deploy-042-smooth-quant-llama2-7b-hf", backend_config=PytorchEngineConfig(), model_name='base')
response = pipe(["中国的首都是", "Shanghai is"], do_preprocess=False)
print(response)

1830

麻烦看一下，为什么Qwen-7B-Chat模型无法使用Smoothquant进行量化

如果当前issue没问题了，我就关掉了。还有疑问可以重新打开。

如果当前issue没问题了，我就关掉了。还有疑问可以重新打开。

最近有一个相关的开发文档，我在使用lmdeploy 导出OPT的相关模型，在导出时，不进行相关的融合，但是量化出来的opt模型精度掉的很多。能够与你直接沟通一下吗？

@yanchenmochen 微信群问问？另外可以试试 --search-scale。

@AllentDan 没有微信群，有微信吗？

https://cdn.vansin.top/internlm/lmdeploy.jpg 可以群里问的，开发的小伙伴都在。

@AllentDan 感谢，已经入群了。提问了。

InternLM / lmdeploy

[Bug] smooth_quant量化后的模型重新运行，lmdeploy无法正常推理 #1822

Checklist

Describe the bug

Reproduction

Environment

Error traceback

1830