Closed CodexDive closed 4 months ago
(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/zhangweixing/model$ lmdeploy chat --backend pytorch llama2-7b-hf/
2024-06-21 07:11:28,011 - lmdeploy - INFO - Checking environment for PyTorch Engine.
2024-06-21 07:11:40,428 - lmdeploy - INFO - Checking model.
2024-06-21 07:11:40,428 - lmdeploy - WARNING - LMDeploy requires transformers version: [4.33.0 ~ 4.38.2], but found version: 4.40.2
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:46<00:00, 23.36s/it]
2024-06-21 07:12:28,526 - lmdeploy - INFO - Patching model.
2024-06-21 07:12:28,614 - lmdeploy - INFO - build CacheEngine with config:CacheConfig(block_size=64, num_cpu_blocks=128, num_gpu_blocks=650, window_size=-1, cache_max_entry_count=0.8, max_prefill_token_num=4096, enable_prefix_caching=False)
match template:
double enter to end input >>> 请给我一首春天的诗歌
[INST] <
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. <>
请给我一首春天的诗歌 [/INST] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
tokenizers
before the fork if possibletokenizers
before the fork if possible[INST] <
你是一个帮助,尊重和诚实的助手。总是回答最有助的方式,同时保持安全。你的答案不应包含任何不道德,暴力,种族主义,性别主义,毒质,危险,或非法内容。请确保你的回答是社会中立和正面的。
如果一个问题不是有意义,或者不是事实上有效的,请不要回答不正确的答案。如果你不知道答案,请不要分享不真实的信息。 <>
请给我一首春天的诗歌 [/INST]
[INST] <
你是一个帮助,尊重和诚实的助手。总是回答最有助的方式,同时保持安全。你的答案不应包含任何不道德,暴力,种族主义,性别主义,毒质,危险,或非法内容。请确保你的回答是社会中立和正面的。
如果一个问题不是有意义,或者不是事实上有效的,请不要回答不正确的答案。如果你不知道答案,请不要分享不真实的信息。 <>
请给我一首春天的诗歌 [/INST]
[INST] <
你是一个帮助,尊重和诚实的助手。总是回答最有助的方式,同时保持安全。你的答
double enter to end input >>> 我使用未量化的版本,也有这个问题。
之前,我记得提示符,不应该是 <|User|> <|Bot|>之类的嘛,这个运行是怎么回事。
<>
请给我一首春天的诗歌 [/INST]
[INST] <
你是一个帮助,尊重和诚实的助手。总是回答最有助的方式,同时保持安全。你的答
double enter to end input >>> 锄禾日当午
[INST] 锄禾日当午 [/INST]
[INST] <
你是一个帮助,尊重和诚实的助手。总是回答最有助的方式,同时保持安全。你的答案不应包含任何不道德,暴力,种族主义,性别主义,毒质,危险,或非法内容。请确保你的回答是社会中立和正面的。
如果一个问题不是有意义,或者不是事实上有效的,请不要回答不正确的答案。如果你不知道答案,请不要分享不真实的信息。 <>
请给我一首春天的诗歌 [/INST]
[INST] <
你是一个帮助,尊
使用为量化的LLama2-7B,或者smoothquant量化过的版本,都会出现这个思索,完全没有办法推理。
@AllentDan may help investigate it
不清楚是不是我的什么配置没有正确。
你用的大概不是 chat 模型
你用的大概不是 chat 模型
是的,是使用问题。base 模型不支持对话的哈 @CodexDive
确实不是chat模型,那如何使用base模型的量化版本?
加 --model-name base
. 这块我们改下后面就不需要手动加了
import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
这样的推理代码,可以直接使用量化后的模型吗?也需要添加这个--model-name base参数吗?
pipline 有 model_name 参数
PyTorch engine arguments:
--adapters [ADAPTERS [ADAPTERS ...]]
Used to set path(s) of lora adapter(s). One can input key-value pairs in
xxx=yyy format for multiple lora adapters. If only have one adapter, one
can only input the path of the adapter.. Default: None. Type: str
--tp TP GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type:
int
--model-name MODEL_NAME
The name of the to-be-deployed model, such as llama-7b, llama-13b,
vicuna-7b and etc. You can run `lmdeploy list` to get the supported model
names. Default: None. Type: str
这个参数解释,确实不明确。
--model-name 这个参数值,应该传递的值是'base',还是'llama2'
(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ lmdeploy list
The older chat template name like "internlm2-7b", "qwen-7b" and so on are deprecated and will be removed in the future. The supported chat template names are:
baichuan2
chatglm
codellama
dbrx
deepseek
deepseek-coder
deepseek-vl
falcon
gemma
internlm
internlm-xcomposer2
internlm-xcomposer2-4khd
internlm2
internvl-internlm2
internvl-zh
internvl-zh-hermes2
llama
llama2
llama3
llava-chatml
llava-v1
mini-gemini-vicuna
mistral
mixtral
phi-3
puyu
qwen
solar
ultracm
ultralm
vicuna
wizardlm
yi
yi-vl
LMDeploy 目前是优先对话模型的处理方式。对于非对话模型,理论上用户只用考虑使用 api_server 的 /v1/completions 接口
那我如果使用lmdeploy来运行base版本,也只能用api_server模式吗?使用推理代码可以运行量化的模型吗?
因为llama2的chat模型无法找到,要支持llama2-7b和llama2-13b这类的量化模型。
import lmdeploy
pipe = lmdeploy.pipeline("you_awosome_w8a8_model_path")
response = pipe(["Hi, pls intro yourself", "Shanghai is"], do_preprocess=False)
print(response)
感谢,帮大忙了。幸亏你说了lmdeploy在驱动base版本以及chat版本的区别,不然真的是毫无头绪
我使用了lmdeploy 0.4.2量化了llama2-7B-hf模型,smoothquant量化没有报错,
(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ lmdeploy lite smooth_quant /mnt/self-define/zhangweixing/model/llama2-7b-hf/ --work-dir lm- deploy-042-smooth-quant-llama2-7b-hf Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.35it/s] Move model.embed_tokens to GPU. Move model.layers.0 to CPU. Move model.layers.1 to CPU. Move model.layers.2 to CPU. Move model.layers.3 to CPU. Move model.layers.4 to CPU. Move model.layers.5 to CPU. Move model.layers.6 to CPU. Move model.layers.7 to CPU. Move model.layers.8 to CPU. Move model.layers.9 to CPU. Move model.layers.10 to CPU. Move model.layers.11 to CPU. Move model.layers.12 to CPU. Move model.layers.13 to CPU. Move model.layers.14 to CPU. Move model.layers.15 to CPU. Move model.layers.16 to CPU. Move model.layers.17 to CPU. Move model.layers.18 to CPU. Move model.layers.19 to CPU. Move model.layers.20 to CPU. Move model.layers.21 to CPU. Move model.layers.22 to CPU. Move model.layers.23 to CPU. Move model.layers.24 to CPU. Move model.layers.25 to CPU. Move model.layers.26 to CPU. Move model.layers.27 to CPU. Move model.layers.28 to CPU. Move model.layers.29 to CPU. Move model.layers.30 to CPU. Move model.layers.31 to CPU. Move model.norm to GPU. Move lm_head to CPU. Loading calibrate dataset ... /home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/datasets/load.py:1429: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only You can avoid this message in future by passing the argument
trust_remote_code=True. Passing
trust_remote_code=Truewill be mandatory to load this dataset from the next major release of
datasets. warnings.warn( /home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/datasets/load.py:1429: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only You can avoid this message in future by passing the argument
trust_remote_code=True. Passing
trust_remote_code=Truewill be mandatory to load this dataset from the next major release of
datasets. warnings.warn( model.layers.0, samples: 128, max gpu memory: 6.63 GB model.layers.1, samples: 128, max gpu memory: 8.63 GB model.layers.2, samples: 128, max gpu memory: 8.63 GB model.layers.3, samples: 128, max gpu memory: 8.63 GB model.layers.4, samples: 128, max gpu memory: 8.63 GB model.layers.5, samples: 128, max gpu memory: 8.63 GB model.layers.6, samples: 128, max gpu memory: 8.63 GB model.layers.7, samples: 128, max gpu memory: 8.63 GB model.layers.8, samples: 128, max gpu memory: 8.63 GB model.layers.9, samples: 128, max gpu memory: 8.63 GB model.layers.10, samples: 128, max gpu memory: 8.63 GB model.layers.11, samples: 128, max gpu memory: 8.63 GB model.layers.12, samples: 128, max gpu memory: 8.63 GB model.layers.13, samples: 128, max gpu memory: 8.63 GB model.layers.14, samples: 128, max gpu memory: 8.63 GB model.layers.15, samples: 128, max gpu memory: 8.63 GB model.layers.16, samples: 128, max gpu memory: 8.63 GB model.layers.17, samples: 128, max gpu memory: 8.63 GB model.layers.18, samples: 128, max gpu memory: 8.63 GB model.layers.19, samples: 128, max gpu memory: 8.63 GB model.layers.20, samples: 128, max gpu memory: 8.63 GB model.layers.21, samples: 128, max gpu memory: 8.63 GB model.layers.22, samples: 128, max gpu memory: 8.63 GB model.layers.23, samples: 128, max gpu memory: 8.63 GB model.layers.24, samples: 128, max gpu memory: 8.63 GB model.layers.25, samples: 128, max gpu memory: 8.63 GB model.layers.26, samples: 128, max gpu memory: 8.63 GB model.layers.27, samples: 128, max gpu memory: 8.63 GB model.layers.28, samples: 128, max gpu memory: 8.63 GB model.layers.29, samples: 128, max gpu memory: 8.63 GB model.layers.30, samples: 128, max gpu memory: 8.63 GB model.layers.31, samples: 128, max gpu memory: 8.63 GB model.layers.0 smooth weight done. model.layers.1 smooth weight done. model.layers.2 smooth weight done. model.layers.3 smooth weight done. model.layers.4 smooth weight done. model.layers.5 smooth weight done. model.layers.6 smooth weight done. model.layers.7 smooth weight done. model.layers.8 smooth weight done. model.layers.9 smooth weight done. model.layers.10 smooth weight done. model.layers.11 smooth weight done. model.layers.12 smooth weight done. model.layers.13 smooth weight done. model.layers.14 smooth weight done. model.layers.15 smooth weight done. model.layers.16 smooth weight done. model.layers.17 smooth weight done. model.layers.18 smooth weight done. model.layers.19 smooth weight done. model.layers.20 smooth weight done. model.layers.21 smooth weight done. model.layers.22 smooth weight done. model.layers.23 smooth weight done. model.layers.24 smooth weight done. model.layers.25 smooth weight done. model.layers.26 smooth weight done. model.layers.27 smooth weight done. model.layers.28 smooth weight done. model.layers.29 smooth weight done. model.layers.30 smooth weight done. model.layers.31 smooth weight done.
但是,我使用推理代码运行时,无法加载量化的模型,如下图所示:
(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy/vllm_test$ python test_inference_smooth_quant_llama.py
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Traceback (most recent call last):
File "test_inference_smooth_quant_llama.py", line 3, in <module>
pipe = lmdeploy.pipeline(model_path="/mnt/self-define/sunning/lmdeploy/lm-deploy-042-smooth-quant-llama2-7b-hf", model_name='base')
File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/api.py", line 94, in pipeline
return pipeline_class(model_path,
File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/serve/async_engine.py", line 206, in __init__
self._build_turbomind(model_path=model_path,
File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/serve/async_engine.py", line 253, in _build_turbomind
self.engine = tm.TurboMind.from_pretrained(
File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 387, in from_pretrained
return cls(model_path=pretrained_model_name_or_path,
File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 161, in __init__
self.model_comm = self._from_hf(model_source=model_source,
File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 270, in _from_hf
output_model = OUTPUT_MODELS.get(output_format)(
File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/w4.py", line 80, in __init__
super().__init__(input_model, cfg, to_file, out_dir)
File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 156, in __init__
self.cfg = self.get_config(cfg)
File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/w4.py", line 92, in get_config
w1s, _, _ = bin.ffn_scale(i)
File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/source_model/llama_awq.py", line 52, in ffn_scale
return ensure_fp16orint32(self._ffn(i, 'scales'))
File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/source_model/llama.py", line 99, in _ffn
tensor = self.params[
KeyError: 'model.layers.0.mlp.gate_proj.scales'
其中文件test_inference_smooth_quant_llama.py的内容如下所示:
import lmdeploy
pipe = lmdeploy.pipeline(model_path="/mnt/self-define/sunning/lmdeploy/lm-deploy-042-smooth-quant-llama2-7b-hf", model_name='base')
response = pipe(["中国的首都是", "Shanghai is"], do_preprocess=False)
print(response)
用下面的代码试试。
import lmdeploy
from lmdeploy.messages import PytorchEngineConfig
pipe = lmdeploy.pipeline(model_path="/mnt/self-define/sunning/lmdeploy/lm-deploy-042-smooth-quant-llama2-7b-hf", backend_config=PytorchEngineConfig(), model_name='base')
response = pipe(["中国的首都是", "Shanghai is"], do_preprocess=False)
print(response)
麻烦看一下,为什么Qwen-7B-Chat模型无法使用Smoothquant进行量化
没做支持。
如果当前issue没问题了,我就关掉了。还有疑问可以重新打开。
如果当前issue没问题了,我就关掉了。还有疑问可以重新打开。
最近有一个相关的开发文档,我在使用lmdeploy 导出OPT的相关模型,在导出时,不进行相关的融合,但是量化出来的opt模型精度掉的很多。能够与你直接沟通一下吗?
@yanchenmochen 微信群问问?另外可以试试 --search-scale
。
@AllentDan 没有微信群,有微信吗?
https://cdn.vansin.top/internlm/lmdeploy.jpg 可以群里问的,开发的小伙伴都在。
@AllentDan 感谢,已经入群了。提问了。
Checklist
Describe the bug
Reproduction
使用lmdeploy 0.4.2 通过smooth_quant 量化了一个模型,结果量化后的模型无法使用和工作
Environment
Error traceback
No response