[Bug] awq for Qwen2-72B-instruct

Vincent131499 commented 1 week ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

针对qwen2-72b-instruct这个模型，使用awq进行量化，尝试了很多设置，都会出现如下bug error,麻烦尽快check下啊，是不是我的使用方式有什么问题呢？

Move model.norm to GPU. Move lm_head to CPU. Loading calibrate dataset ... Using the latest cached version of the module from /hpc_stor01/home/guangfeng.liu/.cache/huggingface/modules/datasets_modules/datasets/ptb_text_only/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f (last modified on Fri Jun 21 07:15:40 2024) since it couldn't be found locally at ptb_text_only, or remotely on the Hugging Face Hub. Using the latest cached version of the module from /hpc_stor01/home/guangfeng.liu/.cache/huggingface/modules/datasets_modules/datasets/ptb_text_only/8d1b97746fb9765d140e569ec5ddd35e20af4d37761f5e1bf357ea0b081f2c1f (last modified on Fri Jun 21 07:15:40 2024) since it couldn't be found locally at ptb_text_only, or remotely on the Hugging Face Hub. Token indices sequence length is longer than the specified maximum sequence length for this model (1104485 > 131072). Running this sequence through the model will result in indexing errors model.layers.0, samples: 8, max gpu memory: 8.22 GB Traceback (most recent call last): File "/opt/py38/bin/lmdeploy", line 8, in sys.exit(run()) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run args.run(args) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 137, in auto_awq auto_awq(kwargs) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/apis/auto_awq.py", line 96, in auto_awq vl_model, model, tokenizer, work_dir = calibrate(model, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/apis/calibrate.py", line 235, in calibrate calib_ctx.calibrate(alldata) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 315, in calibrate = model(data.to(self.device)) File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/opt/py38/lib/python3.8/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1034, in forward layer_outputs = decoder_layer( File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/opt/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 505, in _forward auto_scale_block(mod, batch_kwargs[i], self.w_bits, File "/opt/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 416, in auto_scale_block _auto_get_scale( File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 400, in _auto_get_scale best_ratio = _search_module_scale(module2inspect, layers, inp.value, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/calibration.py", line 375, in _search_module_scale fc.weight.data = pseudo_quantize_tensor( File "/opt/py38/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 289, in pseudo_quantize_tensor assert torch.isnan(scales).sum() == 0 AssertionError

Reproduction

尝试了以下3个配置，均error 1.lmdeploy lite auto_awq ../pretrained-models/qwen2-72b-instruct/ --calib-dataset 'ptb' --calib-samples 128 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --work-dir ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy-new2/ --batch-size 1 2.lmdeploy lite auto_awq ../pretrained-models/qwen2-72b-instruct/ --calib-dataset 'ptb' --calib-samples 32 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --work-dir ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy-new2/ --batch-size 1 3.lmdeploy lite auto_awq ../pretrained-models/qwen2-72b-instruct/ --calib-dataset 'ptb' --calib-samples 32 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --work-dir ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy-new2/ --batch-size 1 --search-scale True

Environment

都是在docker-v0.4.2版本操作

Error traceback

No response

zhyncs commented 1 week ago

@Vincent131499 If you want to just use Qwen2-72B-Instruct-AWQ, you may try https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ This is out of the box :P

serser commented 1 week ago

Related to #1786

Vincent131499 commented 1 week ago

@Vincent131499 If you want to just use Qwen2-72B-Instruct-AWQ, you may try https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ This is out of the box :P

直接读取官方awq是可以的，但是tp有问题。 tp=1正常推理 tp≥2转换到turbomind异常。这个我用vllm验证了下，多卡推理正常。是不是要优先解决下这个awq量化和awq多卡推理的bug，我看qwen2系列模型已经积攒了不少这一类的bug了，感谢！ @lvhan028 @AllentDan @zhyncs

lvhan028 commented 1 week ago

抱歉，组内目前正在赶6月份的版本，这个问题我们会在7月份处理。

Vincent131499 commented 1 week ago

抱歉，组内目前正在赶6月份的版本，这个问题我们会在7月份处理。

好的，等你们消息

Xu-Chen commented 1 week ago

@Vincent131499 If you want to just use Qwen2-72B-Instruct-AWQ, you may try https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ This is out of the box :P

直接读取官方awq是可以的，但是tp有问题。 tp=1正常推理 tp≥2转换到turbomind异常。这个我用vllm验证了下，多卡推理正常。是不是要优先解决下这个awq量化和awq多卡推理的bug，我看qwen2系列模型已经积攒了不少这一类的bug了，感谢！ @lvhan028 @AllentDan @zhyncs

主要是中间层隐藏纬度的问题，你会发现官方量化的版本，纬度是29696， 29696÷128=232，232可以被8整除，也就是最多应该可以tp8，主要是一般用group_size为128。而fp16原版是29568, 29568÷128=231,，已经不能被任何2的倍数整除了，你可以试试32的group_size

自行把原版的fp16转化为29696的版本（补0，不知道影不影响效果），可以解决自行量化后无法tp的问题。

Tendo33 commented 5 days ago

同问

lvhan028 commented 4 days ago

@Vincent131499 麻烦贴一下 lmdeploy check_env 的结果，我们看下在哪个环境中可以复现

Vincent131499 commented 4 days ago

@Vincent131499 麻烦贴一下 lmdeploy check_env 的结果，我们看下在哪个环境中可以复现

@lvhan028 pull的docker镜像：openmmlab/lmdeploy:v0.4.2 服务器：8卡A100 运行：在docker容器中运行量化相关命令

sys.platform: linux Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-40GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.8, V11.8.89 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 2.1.0+cu118 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 11.8
NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
CuDNN 8.7
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.16.0+cu118 LMDeploy: 0.4.2+6bb0b37 transformers: 4.41.1 gradio: 3.50.2 fastapi: 0.111.0 pydantic: 2.7.1 triton: 2.1.0

AllentDan commented 4 days ago

这个问题我用了 AutoAWQ 也有，本质是因为 Qwen2-72B-instruct 模型权重里有0. 的值，算 scale 时除以 0. 会得到 Nan。目前可以通过 clamp 到一个 min 规避。但是这样得到的量化是有偏的，可能会有精度问题。https://github.com/InternLM/lmdeploy/blob/a5aeee34142fef6a12fba53c09889d4a293572d0/lmdeploy/lite/quantization/calibration.py#L380 这里括号内的分母需要 clamp 下。

AllentDan commented 4 days ago

我试了下 https://github.com/InternLM/lmdeploy/pull/1844 量化 72B 模型，量化完可以正常对话

Vincent131499 commented 3 days ago

我试了下 #1844 量化 72B 模型，量化完可以正常对话

@AllentDan 这个https://github.com/InternLM/lmdeploy/pull/1844 使用的量化命令是啥呢，已经可以正常量化了是嘛？不会出现NaN-bug了？

AllentDan commented 3 days ago

按 PR 修改的内容自己修改本地代码，可以直接跑量化了，不会 NaN。

Vincent131499 commented 3 days ago

按 PR 修改的内容自己修改本地代码，可以直接跑量化了，不会 NaN。

@AllentDan 你这边量化的命令是啥，我复现一下

AllentDan commented 3 days ago

默认的

Vincent131499 commented 3 days ago

@AllentDan 我这边使用tp-4，还是不行： lmdeploy serve api_server ../pretrained-models/qwen2-72b-instruct-w4-lmdeploy/ --backend turbomind --model-format awq --quant-policy 8 --tp 4

Convert to turbomind format: 0%| | 0/80 [00:00<?, ?it/s]Traceback (most recent call last): File "/opt/py38/bin/lmdeploy", line 8, in sys.exit(run()) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run args.run(args) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/cli/serve.py", line 303, in api_server run_api_server(args.model_path, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/serve/openai/api_server.py", line 1197, in serve VariableInterface.async_engine = pipeline_class( File "/opt/py38/lib/python3.8/site-packages/lmdeploy/serve/async_engine.py", line 200, in init self._build_turbomind(model_path=model_path, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/serve/async_engine.py", line 247, in _build_turbomind self.engine = tm.TurboMind.from_pretrained( File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 344, in from_pretrained return cls(model_path=pretrained_model_name_or_path, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 146, in init self.model_comm = self._from_hf(model_source=model_source, File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 259, in _from_hf output_model.export() File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 283, in export self.export_transformer_block(bin, i) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/w4.py", line 156, in export_transformer_block self.save_split(w2_sz, f'layers.{i}.feed_forward.w2.scales_zeros', 0) File "/opt/py38/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 256, in save_split assert tensor.shape[split_dim] % tp == 0 AssertionError

AllentDan commented 3 days ago

没说支持 tp 的事情，是说量化不报 NaN。

InternLM / lmdeploy