InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.11k stars 280 forks source link

[Bug] Qwen-7B-Chat 量化报错 AttributeError: 'RMSNorm' object has no attribute 'variance_epsilon' #1830

Open CodexDive opened 6 days ago

CodexDive commented 6 days ago

Checklist

Describe the bug

当前lmdeploy仅仅支持smoothquant量化的大模型如下: LAYER_TYPE_MAP = { 'InternLMForCausalLM': 'InternLMDecoderLayer', 'InternLM2ForCausalLM': 'InternLM2DecoderLayer', 'QWenLMHeadModel': 'QWenBlock', 'BaiChuanForCausalLM': 'DecoderLayer', 'LlamaForCausalLM': 'LlamaDecoderLayer', } 自己尝试量化Qwen-7B-Chat,报错如下

(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ lmdeploy lite smooth_quant Qwen-7B-Chat/ --work-dir lmdeploy-042-smooth-quant-Qwen-7B-Chat
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  3.74it/s]
Move transformer.wte to GPU.
Move transformer.drop to GPU.
Move transformer.rotary_emb to GPU.
Move transformer.h.0 to CPU.
Move transformer.h.1 to CPU.
Move transformer.h.2 to CPU.
Move transformer.h.3 to CPU.
Move transformer.h.4 to CPU.
Move transformer.h.5 to CPU.
Move transformer.h.6 to CPU.
Move transformer.h.7 to CPU.
Move transformer.h.8 to CPU.
Move transformer.h.9 to CPU.
Move transformer.h.10 to CPU.
Move transformer.h.11 to CPU.
Move transformer.h.12 to CPU.
Move transformer.h.13 to CPU.
Move transformer.h.14 to CPU.
Move transformer.h.15 to CPU.
Move transformer.h.16 to CPU.
Move transformer.h.17 to CPU.
Move transformer.h.18 to CPU.
Move transformer.h.19 to CPU.
Move transformer.h.20 to CPU.
Move transformer.h.21 to CPU.
Move transformer.h.22 to CPU.
Move transformer.h.23 to CPU.
Move transformer.h.24 to CPU.
Move transformer.h.25 to CPU.
Move transformer.h.26 to CPU.
Move transformer.h.27 to CPU.
Move transformer.h.28 to CPU.
Move transformer.h.29 to CPU.
Move transformer.h.30 to CPU.
Move transformer.h.31 to CPU.
Move transformer.ln_f to GPU.
Move lm_head to CPU.
Loading calibrate dataset ...
/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/datasets/load.py:1429: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/datasets/load.py:1429: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Token indices sequence length is longer than the specified maximum sequence length for this model (1104485 > 32768). Running this sequence through the model will result in indexing errors
transformer.h.0, samples: 128, max gpu memory: 7.55 GB
transformer.h.1, samples: 128, max gpu memory: 9.55 GB
transformer.h.2, samples: 128, max gpu memory: 9.55 GB
transformer.h.3, samples: 128, max gpu memory: 9.55 GB
transformer.h.4, samples: 128, max gpu memory: 9.55 GB
transformer.h.5, samples: 128, max gpu memory: 9.55 GB
transformer.h.6, samples: 128, max gpu memory: 9.55 GB
transformer.h.7, samples: 128, max gpu memory: 9.55 GB
transformer.h.8, samples: 128, max gpu memory: 9.55 GB
transformer.h.9, samples: 128, max gpu memory: 9.55 GB
transformer.h.10, samples: 128, max gpu memory: 9.55 GB
transformer.h.11, samples: 128, max gpu memory: 9.55 GB
transformer.h.12, samples: 128, max gpu memory: 9.55 GB
transformer.h.13, samples: 128, max gpu memory: 9.55 GB
transformer.h.14, samples: 128, max gpu memory: 9.55 GB
transformer.h.15, samples: 128, max gpu memory: 9.55 GB
transformer.h.16, samples: 128, max gpu memory: 9.55 GB
transformer.h.17, samples: 128, max gpu memory: 9.55 GB
transformer.h.18, samples: 128, max gpu memory: 9.55 GB
transformer.h.19, samples: 128, max gpu memory: 9.55 GB
transformer.h.20, samples: 128, max gpu memory: 9.55 GB
transformer.h.21, samples: 128, max gpu memory: 9.55 GB
transformer.h.22, samples: 128, max gpu memory: 9.55 GB
transformer.h.23, samples: 128, max gpu memory: 9.55 GB
transformer.h.24, samples: 128, max gpu memory: 9.55 GB
transformer.h.25, samples: 128, max gpu memory: 9.55 GB
transformer.h.26, samples: 128, max gpu memory: 9.55 GB
transformer.h.27, samples: 128, max gpu memory: 9.55 GB
transformer.h.28, samples: 128, max gpu memory: 9.55 GB
transformer.h.29, samples: 128, max gpu memory: 9.55 GB
transformer.h.30, samples: 128, max gpu memory: 9.55 GB
transformer.h.31, samples: 128, max gpu memory: 9.55 GB
transformer.h.0 smooth weight done.
transformer.h.1 smooth weight done.
transformer.h.2 smooth weight done.
transformer.h.3 smooth weight done.
transformer.h.4 smooth weight done.
transformer.h.5 smooth weight done.
transformer.h.6 smooth weight done.
transformer.h.7 smooth weight done.
transformer.h.8 smooth weight done.
transformer.h.9 smooth weight done.
transformer.h.10 smooth weight done.
transformer.h.11 smooth weight done.
transformer.h.12 smooth weight done.
transformer.h.13 smooth weight done.
transformer.h.14 smooth weight done.
transformer.h.15 smooth weight done.
transformer.h.16 smooth weight done.
transformer.h.17 smooth weight done.
transformer.h.18 smooth weight done.
transformer.h.19 smooth weight done.
transformer.h.20 smooth weight done.
transformer.h.21 smooth weight done.
transformer.h.22 smooth weight done.
transformer.h.23 smooth weight done.
transformer.h.24 smooth weight done.
transformer.h.25 smooth weight done.
transformer.h.26 smooth weight done.
transformer.h.27 smooth weight done.
transformer.h.28 smooth weight done.
transformer.h.29 smooth weight done.
transformer.h.30 smooth weight done.
transformer.h.31 smooth weight done.
Traceback (most recent call last):
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 158, in smooth_quant
    smooth_quant(**kwargs)
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/lite/apis/smooth_quant.py", line 141, in smooth_quant
    q_norm = QRMSNorm.from_float(norm)
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/pytorch/models/q_modules.py", line 51, in from_float
    eps = mod.variance_epsilon
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'RMSNorm' object has no attribute 'variance_epsilon'

Reproduction

lmdeploy lite smooth_quant Qwen-7B-Chat/ --work-dir lmdeploy-042-smooth-quant-Qwen-7B-Chat

Environment

(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ lmdeploy check_env
sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda-12.0
NVCC: Cuda compilation tools, release 12.0, V12.0.140
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.2.1+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.7  (built against CUDA 12.2)
    - Built with CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.1+cu118
LMDeploy: 0.4.2+9a00760
transformers: 4.40.2
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.2.0

Error traceback

No response

CodexDive commented 3 days ago

这个Qwen-7B-Chat的smoothquant量化问题,是因为什么,当前不支持吗? @AllentDan Qwen的那个版本是lmdeploy0.4.2支持的,尤其是chat版本