Open Vincent131499 opened 1 week ago
@Vincent131499 If you want to just use Qwen2-72B-Instruct-AWQ, you may try https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ This is out of the box :P
Related to #1786
@Vincent131499 If you want to just use Qwen2-72B-Instruct-AWQ, you may try https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ This is out of the box :P
直接读取官方awq是可以的,但是tp有问题。 tp=1正常推理 tp≥2转换到turbomind异常。 这个我用vllm验证了下,多卡推理正常。 是不是要优先解决下这个awq量化和awq多卡推理的bug,我看qwen2系列模型已经积攒了不少这一类的bug了,感谢! @lvhan028 @AllentDan @zhyncs
抱歉,组内目前正在赶6月份的版本,这个问题我们会在7月份处理。
抱歉,组内目前正在赶6月份的版本,这个问题我们会在7月份处理。
好的,等你们消息
@Vincent131499 If you want to just use Qwen2-72B-Instruct-AWQ, you may try https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ This is out of the box :P
直接读取官方awq是可以的,但是tp有问题。 tp=1正常推理 tp≥2转换到turbomind异常。 这个我用vllm验证了下,多卡推理正常。 是不是要优先解决下这个awq量化和awq多卡推理的bug,我看qwen2系列模型已经积攒了不少这一类的bug了,感谢! @lvhan028 @AllentDan @zhyncs
主要是中间层隐藏纬度的问题,你会发现官方量化的版本,纬度是29696, 29696÷128=232,232可以被8整除,也就是最多应该可以tp8,主要是一般用group_size为128。 而fp16原版是29568, 29568÷128=231,,已经不能被任何2的倍数整除了,你可以试试32的group_size
自行把原版的fp16转化为29696的版本(补0,不知道影不影响效果),可以解决自行量化后无法tp的问题。
同问
@Vincent131499 麻烦贴一下 lmdeploy check_env
的结果,我们看下在哪个环境中可以复现
@Vincent131499 麻烦贴一下
lmdeploy check_env
的结果,我们看下在哪个环境中可以复现
@lvhan028 pull的docker镜像:openmmlab/lmdeploy:v0.4.2 服务器:8卡A100 运行:在docker容器中运行量化相关命令
sys.platform: linux Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-40GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.8, V11.8.89 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 2.1.0+cu118 PyTorch compiling details: PyTorch built with:
TorchVision: 0.16.0+cu118 LMDeploy: 0.4.2+6bb0b37 transformers: 4.41.1 gradio: 3.50.2 fastapi: 0.111.0 pydantic: 2.7.1 triton: 2.1.0
这个问题我用了 AutoAWQ 也有,本质是因为 Qwen2-72B-instruct 模型权重里有0. 的值,算 scale 时除以 0. 会得到 Nan。目前可以通过 clamp 到一个 min 规避。但是这样得到的量化是有偏的,可能会有精度问题。https://github.com/InternLM/lmdeploy/blob/a5aeee34142fef6a12fba53c09889d4a293572d0/lmdeploy/lite/quantization/calibration.py#L380 这里括号内的分母需要 clamp 下。
我试了下 https://github.com/InternLM/lmdeploy/pull/1844 量化 72B 模型,量化完可以正常对话
我试了下 #1844 量化 72B 模型,量化完可以正常对话
@AllentDan 这个https://github.com/InternLM/lmdeploy/pull/1844 使用的量化命令是啥呢,已经可以正常量化了是嘛?不会出现NaN-bug了?
按 PR 修改的内容自己修改本地代码,可以直接跑量化了,不会 NaN。
按 PR 修改的内容自己修改本地代码,可以直接跑量化了,不会 NaN。
@AllentDan 你这边量化的命令是啥,我复现一下
默认的
@AllentDan 我这边使用tp-4,还是不行: lmdeploy serve api_server ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy/ --backend turbomind --model-format awq --quant-policy 8 --tp 4
Convert to turbomind format: 0%| | 0/80 [00:00<?, ?it/s]Traceback (most recent call last):
File "/opt/py38/bin/lmdeploy", line 8, in
没说支持 tp 的事情,是说量化不报 NaN。
Checklist
Describe the bug
针对qwen2-72b-instruct这个模型,使用awq进行量化,尝试了很多设置,都会出现如下bug error,麻烦尽快check下啊,是不是我的使用方式有什么问题呢?
Move model.norm to GPU. Move lm_head to CPU. Loading calibrate dataset ... Using the latest cached version of the module from /hpc_stor01/home/guangfeng.liu/.cache/huggingface/modules/datasets_modules/datasets/ptb_text_only/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f (last modified on Fri Jun 21 07:15:40 2024) since it couldn't be found locally at ptb_text_only, or remotely on the Hugging Face Hub. Using the latest cached version of the module from /hpc_stor01/home/guangfeng.liu/.cache/huggingface/modules/datasets_modules/datasets/ptb_text_only/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f (last modified on Fri Jun 21 07:15:40 2024) since it couldn't be found locally at ptb_text_only, or remotely on the Hugging Face Hub. Token indices sequence length is longer than the specified maximum sequence length for this model (1104485 > 131072). Running this sequence through the model will result in indexing errors model.layers.0, samples: 8, max gpu memory: 8.22 GB Traceback (most recent call last): File "/opt/py38/bin/lmdeploy", line 8, in
sys.exit(run())
File "/opt/py38/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
args.run(args)
File "/opt/py38/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 137, in auto_awq
auto_awq(kwargs)
File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/apis/auto_awq.py", line 96, in auto_awq
vl_model, model, tokenizer, work_dir = calibrate(model,
File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/apis/calibrate.py", line 235, in calibrate
calib_ctx.calibrate(alldata)
File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 315, in calibrate
= model(data.to(self.device))
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/opt/py38/lib/python3.8/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1034, in forward
layer_outputs = decoder_layer(
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 505, in _forward
auto_scale_block(mod, batch_kwargs[i], self.w_bits,
File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 416, in auto_scale_block
_auto_get_scale(
File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 400, in _auto_get_scale
best_ratio = _search_module_scale(module2inspect, layers, inp.value,
File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 375, in _search_module_scale
fc.weight.data = pseudo_quantize_tensor(
File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 289, in pseudo_quantize_tensor
assert torch.isnan(scales).sum() == 0
AssertionError
Reproduction
尝试了以下3个配置,均error 1.lmdeploy lite auto_awq ../pretrained-models/qwen2-72b-instruct/ --calib-dataset 'ptb' --calib-samples 128 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --work-dir ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy-new2/ --batch-size 1 2.lmdeploy lite auto_awq ../pretrained-models/qwen2-72b-instruct/ --calib-dataset 'ptb' --calib-samples 32 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --work-dir ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy-new2/ --batch-size 1 3.lmdeploy lite auto_awq ../pretrained-models/qwen2-72b-instruct/ --calib-dataset 'ptb' --calib-samples 32 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --work-dir ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy-new2/ --batch-size 1 --search-scale True
Environment
Error traceback
No response