InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.11k stars 280 forks source link

[Bug] awq for Qwen2-72B-instruct #1826

Open Vincent131499 opened 1 week ago

Vincent131499 commented 1 week ago

Checklist

Describe the bug

针对qwen2-72b-instruct这个模型,使用awq进行量化,尝试了很多设置,都会出现如下bug error,麻烦尽快check下啊,是不是我的使用方式有什么问题呢?

Move model.norm to GPU. Move lm_head to CPU. Loading calibrate dataset ... Using the latest cached version of the module from /hpc_stor01/home/guangfeng.liu/.cache/huggingface/modules/datasets_modules/datasets/ptb_text_only/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f (last modified on Fri Jun 21 07:15:40 2024) since it couldn't be found locally at ptb_text_only, or remotely on the Hugging Face Hub. Using the latest cached version of the module from /hpc_stor01/home/guangfeng.liu/.cache/huggingface/modules/datasets_modules/datasets/ptb_text_only/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f (last modified on Fri Jun 21 07:15:40 2024) since it couldn't be found locally at ptb_text_only, or remotely on the Hugging Face Hub. Token indices sequence length is longer than the specified maximum sequence length for this model (1104485 > 131072). Running this sequence through the model will result in indexing errors model.layers.0, samples: 8, max gpu memory: 8.22 GB Traceback (most recent call last): File "/opt/py38/bin/lmdeploy", line 8, in sys.exit(run()) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run args.run(args) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 137, in auto_awq auto_awq(kwargs) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/apis/auto_awq.py", line 96, in auto_awq vl_model, model, tokenizer, work_dir = calibrate(model, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/apis/calibrate.py", line 235, in calibrate calib_ctx.calibrate(alldata) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 315, in calibrate = model(data.to(self.device)) File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/opt/py38/lib/python3.8/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1034, in forward layer_outputs = decoder_layer( File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 505, in _forward auto_scale_block(mod, batch_kwargs[i], self.w_bits, File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 416, in auto_scale_block _auto_get_scale( File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 400, in _auto_get_scale best_ratio = _search_module_scale(module2inspect, layers, inp.value, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 375, in _search_module_scale fc.weight.data = pseudo_quantize_tensor( File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 289, in pseudo_quantize_tensor assert torch.isnan(scales).sum() == 0 AssertionError

Reproduction

尝试了以下3个配置,均error 1.lmdeploy lite auto_awq ../pretrained-models/qwen2-72b-instruct/ --calib-dataset 'ptb' --calib-samples 128 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --work-dir ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy-new2/ --batch-size 1 2.lmdeploy lite auto_awq ../pretrained-models/qwen2-72b-instruct/ --calib-dataset 'ptb' --calib-samples 32 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --work-dir ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy-new2/ --batch-size 1 3.lmdeploy lite auto_awq ../pretrained-models/qwen2-72b-instruct/ --calib-dataset 'ptb' --calib-samples 32 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --work-dir ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy-new2/ --batch-size 1 --search-scale True

Environment

都是在docker-v0.4.2版本操作

Error traceback

No response

zhyncs commented 1 week ago

@Vincent131499 If you want to just use Qwen2-72B-Instruct-AWQ, you may try https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ This is out of the box :P

serser commented 1 week ago

Related to #1786

Vincent131499 commented 1 week ago

@Vincent131499 If you want to just use Qwen2-72B-Instruct-AWQ, you may try https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ This is out of the box :P

直接读取官方awq是可以的,但是tp有问题。 tp=1正常推理 tp≥2转换到turbomind异常。 这个我用vllm验证了下,多卡推理正常。 是不是要优先解决下这个awq量化和awq多卡推理的bug,我看qwen2系列模型已经积攒了不少这一类的bug了,感谢! @lvhan028 @AllentDan @zhyncs

lvhan028 commented 1 week ago

抱歉,组内目前正在赶6月份的版本,这个问题我们会在7月份处理。

Vincent131499 commented 1 week ago

抱歉,组内目前正在赶6月份的版本,这个问题我们会在7月份处理。

好的,等你们消息

Xu-Chen commented 1 week ago

@Vincent131499 If you want to just use Qwen2-72B-Instruct-AWQ, you may try https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ This is out of the box :P

直接读取官方awq是可以的,但是tp有问题。 tp=1正常推理 tp≥2转换到turbomind异常。 这个我用vllm验证了下,多卡推理正常。 是不是要优先解决下这个awq量化和awq多卡推理的bug,我看qwen2系列模型已经积攒了不少这一类的bug了,感谢! @lvhan028 @AllentDan @zhyncs

主要是中间层隐藏纬度的问题,你会发现官方量化的版本,纬度是29696, 29696÷128=232,232可以被8整除,也就是最多应该可以tp8,主要是一般用group_size为128。 而fp16原版是29568, 29568÷128=231,,已经不能被任何2的倍数整除了,你可以试试32的group_size

自行把原版的fp16转化为29696的版本(补0,不知道影不影响效果),可以解决自行量化后无法tp的问题。

Tendo33 commented 5 days ago

同问

lvhan028 commented 4 days ago

@Vincent131499 麻烦贴一下 lmdeploy check_env 的结果,我们看下在哪个环境中可以复现

Vincent131499 commented 4 days ago

@Vincent131499 麻烦贴一下 lmdeploy check_env 的结果,我们看下在哪个环境中可以复现

@lvhan028 pull的docker镜像:openmmlab/lmdeploy:v0.4.2 服务器:8卡A100 运行:在docker容器中运行量化相关命令

sys.platform: linux Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-40GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.8, V11.8.89 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 2.1.0+cu118 PyTorch compiling details: PyTorch built with:

TorchVision: 0.16.0+cu118 LMDeploy: 0.4.2+6bb0b37 transformers: 4.41.1 gradio: 3.50.2 fastapi: 0.111.0 pydantic: 2.7.1 triton: 2.1.0

AllentDan commented 4 days ago

这个问题我用了 AutoAWQ 也有,本质是因为 Qwen2-72B-instruct 模型权重里有0. 的值,算 scale 时除以 0. 会得到 Nan。目前可以通过 clamp 到一个 min 规避。但是这样得到的量化是有偏的,可能会有精度问题。https://github.com/InternLM/lmdeploy/blob/a5aeee34142fef6a12fba53c09889d4a293572d0/lmdeploy/lite/quantization/calibration.py#L380 这里括号内的分母需要 clamp 下。

AllentDan commented 4 days ago

我试了下 https://github.com/InternLM/lmdeploy/pull/1844 量化 72B 模型,量化完可以正常对话

Vincent131499 commented 3 days ago

我试了下 #1844 量化 72B 模型,量化完可以正常对话

@AllentDan 这个https://github.com/InternLM/lmdeploy/pull/1844 使用的量化命令是啥呢,已经可以正常量化了是嘛?不会出现NaN-bug了?

AllentDan commented 3 days ago

按 PR 修改的内容自己修改本地代码,可以直接跑量化了,不会 NaN。

Vincent131499 commented 3 days ago

按 PR 修改的内容自己修改本地代码,可以直接跑量化了,不会 NaN。

@AllentDan 你这边量化的命令是啥,我复现一下

AllentDan commented 3 days ago

默认的

Vincent131499 commented 3 days ago

@AllentDan 我这边使用tp-4,还是不行: lmdeploy serve api_server ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy/ --backend turbomind --model-format awq --quant-policy 8 --tp 4

Convert to turbomind format: 0%| | 0/80 [00:00<?, ?it/s]Traceback (most recent call last): File "/opt/py38/bin/lmdeploy", line 8, in sys.exit(run()) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run args.run(args) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/cli/serve.py", line 303, in api_server run_api_server(args.model_path, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/serve/openai/api_server.py", line 1197, in serve VariableInterface.async_engine = pipeline_class( File "/opt/py38/lib/python3.8/site-packages/lmdeploy/serve/async_engine.py", line 200, in init self._build_turbomind(model_path=model_path, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/serve/async_engine.py", line 247, in _build_turbomind self.engine = tm.TurboMind.from_pretrained( File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 344, in from_pretrained return cls(model_path=pretrained_model_name_or_path, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 146, in init self.model_comm = self._from_hf(model_source=model_source, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 259, in _from_hf output_model.export() File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 283, in export self.export_transformer_block(bin, i) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/w4.py", line 156, in export_transformer_block self.save_split(w2_sz, f'layers.{i}.feed_forward.w2.scales_zeros', 0) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 256, in save_split assert tensor.shape[split_dim] % tp == 0 AssertionError

AllentDan commented 3 days ago

没说支持 tp 的事情,是说量化不报 NaN。