InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.11k stars 280 forks source link

[Bug] smoothquant量化Bacihuan2-7B-Chat模型,无法正常量化 #1831

Open CodexDive opened 6 days ago

CodexDive commented 6 days ago

Checklist

Describe the bug

(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ lmdeploy lite smooth_quant Baichuan2-7B-Chat/ --work-dir lmdeploy-042-smooth-quant-Baichuan2-7B-Chat Move model.embed_tokens to GPU. Move model.layers.0 to CPU. Move model.layers.1 to CPU. Move model.layers.2 to CPU. Move model.layers.3 to CPU. Move model.layers.4 to CPU. Move model.layers.5 to CPU. Move model.layers.6 to CPU. Move model.layers.7 to CPU. Move model.layers.8 to CPU. Move model.layers.9 to CPU. Move model.layers.10 to CPU. Move model.layers.11 to CPU. Move model.layers.12 to CPU. Move model.layers.13 to CPU. Move model.layers.14 to CPU. Move model.layers.15 to CPU. Move model.layers.16 to CPU. Move model.layers.17 to CPU. Move model.layers.18 to CPU. Move model.layers.19 to CPU. Move model.layers.20 to CPU. Move model.layers.21 to CPU. Move model.layers.22 to CPU. Move model.layers.23 to CPU. Move model.layers.24 to CPU. Move model.layers.25 to CPU. Move model.layers.26 to CPU. Move model.layers.27 to CPU. Move model.layers.28 to CPU. Move model.layers.29 to CPU. Move model.layers.30 to CPU. Move model.layers.31 to CPU. Move model.norm to GPU. Move lm_head to CPU. Loading calibrate dataset ... /home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/datasets/load.py:1429: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only You can avoid this message in future by passing the argument trust_remote_code=True. Passing trust_remote_code=True will be mandatory to load this dataset from the next major release of datasets. warnings.warn( Token indices sequence length is longer than the specified maximum sequence length for this model (1138791 > 4096). Running this sequence through the model will result in indexing errors model.layers.0, samples: 128, max gpu memory: 8.35 GB model.layers.1, samples: 128, max gpu memory: 10.35 GB model.layers.2, samples: 128, max gpu memory: 10.36 GB model.layers.3, samples: 128, max gpu memory: 10.36 GB model.layers.4, samples: 128, max gpu memory: 10.37 GB model.layers.5, samples: 128, max gpu memory: 10.37 GB model.layers.6, samples: 128, max gpu memory: 10.37 GB model.layers.7, samples: 128, max gpu memory: 10.38 GB model.layers.8, samples: 128, max gpu memory: 10.38 GB model.layers.9, samples: 128, max gpu memory: 10.38 GB model.layers.10, samples: 128, max gpu memory: 10.39 GB model.layers.11, samples: 128, max gpu memory: 10.39 GB model.layers.12, samples: 128, max gpu memory: 10.40 GB model.layers.13, samples: 128, max gpu memory: 10.40 GB model.layers.14, samples: 128, max gpu memory: 10.40 GB model.layers.15, samples: 128, max gpu memory: 10.41 GB model.layers.16, samples: 128, max gpu memory: 10.41 GB model.layers.17, samples: 128, max gpu memory: 10.42 GB model.layers.18, samples: 128, max gpu memory: 10.42 GB model.layers.19, samples: 128, max gpu memory: 10.42 GB model.layers.20, samples: 128, max gpu memory: 10.43 GB model.layers.21, samples: 128, max gpu memory: 10.43 GB model.layers.22, samples: 128, max gpu memory: 10.44 GB model.layers.23, samples: 128, max gpu memory: 10.44 GB model.layers.24, samples: 128, max gpu memory: 10.44 GB model.layers.25, samples: 128, max gpu memory: 10.45 GB model.layers.26, samples: 128, max gpu memory: 10.45 GB model.layers.27, samples: 128, max gpu memory: 10.46 GB model.layers.28, samples: 128, max gpu memory: 10.46 GB model.layers.29, samples: 128, max gpu memory: 10.46 GB model.layers.30, samples: 128, max gpu memory: 10.47 GB model.layers.31, samples: 128, max gpu memory: 10.47 GB Traceback (most recent call last): File "/home/yuzailiang/anaconda3/envs/lmdeploy042/bin/lmdeploy", line 8, in sys.exit(run()) File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run args.run(args) File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 158, in smooth_quant smooth_quant(**kwargs) File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/lite/apis/smooth_quant.py", line 97, in smooth_quant raise RuntimeError( RuntimeError: Currently, quantification and calibration of BaichuanForCausalLM are not supported. The supported model types are InternLMForCausalLM, InternLM2ForCausalLM, QWenLMHeadModel, BaiChuanForCausalLM, LlamaForCausalLM.

Reproduction

lmdeploy lite smooth_quant Baichuan2-7B-Chat/ --work-dir lmdeploy-042-smooth-quant-Baichuan2-7B-Chat 校准的过程可以正常执行,但是提示RuntimeError: Currently, quantification and calibration of BaichuanForCausalLM are not supported. The supported model types are InternLMForCausalLM, InternLM2ForCausalLM, QWenLMHeadModel, BaiChuanForCausalLM, LlamaForCausalLM.

Environment

(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ lmdeploy check_env
sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda-12.0
NVCC: Cuda compilation tools, release 12.0, V12.0.140
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.2.1+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.7  (built against CUDA 12.2)
    - Built with CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.1+cu118
LMDeploy: 0.4.2+9a00760
transformers: 4.40.2
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.2.0

Error traceback

No response

lvhan028 commented 6 days ago

As the log said, "Currently, quantification and calibration of BaichuanForCausalLM are not supported. The supported model types are InternLMForCausalLM, InternLM2ForCausalLM, QWenLMHeadModel, BaiChuanForCausalLM, LlamaForCausalLM."

CodexDive commented 5 days ago

现在smoothquant支持的大预言模型,InternLMForCausalLM, InternLM2ForCausalLM,这两个是书生蒲语的v1和v2.那QWenLMHeadModel BaiChuanForCausalLM LlamaForCausalLM这三个可以用smooth quant量化的模型是什么?

CodexDive commented 5 days ago

Baichuan2-7B-Chat对应的不是BaiChuanForCausalLM 吗?

CodexDive commented 5 days ago

在量化Baichuan-13B-Chat的时候,依然报错了

(lmdeploy042) yuzailiang@ubuntu:/mnt/self-define/sunning/lmdeploy$ lmdeploy lite smooth_quant /mnt/self-define/zhangweixing/model/Baichuan-13B-Chat --work-dir lmdeploy-042-smooth-quant-Baichuan-13B-Chat
Traceback (most recent call last):
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 158, in smooth_quant
    smooth_quant(**kwargs)
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/lite/apis/smooth_quant.py", line 79, in smooth_quant
    vl_model, model, tokenizer, work_dir = calibrate(model,
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/lmdeploy/lite/apis/calibrate.py", line 171, in calibrate
    tokenizer = AutoTokenizer.from_pretrained(model,
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 810, in from_pretrained
    return tokenizer_class.from_pretrained(
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
    return cls._from_pretrained(
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/yuzailiang/.cache/huggingface/modules/transformers_modules/Baichuan-13B-Chat/tokenization_baichuan.py", line 55, in __init__
    super().__init__(
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 367, in __init__
    self._add_tokens(
  File "/home/yuzailiang/anaconda3/envs/lmdeploy042/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
    current_vocab = self.get_vocab().copy()
  File "/home/yuzailiang/.cache/huggingface/modules/transformers_modules/Baichuan-13B-Chat/tokenization_baichuan.py", line 89, in get_vocab
    vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
  File "/home/yuzailiang/.cache/huggingface/modules/transformers_modules/Baichuan-13B-Chat/tokenization_baichuan.py", line 85, in vocab_size
    return self.sp_model.get_piece_size()
AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'
AllentDan commented 5 days ago

好像是名字大小写也不一样。另外你的百川模型,用transformers肯定也跑不起来,用他们config里要求的版本吧。

CodexDive commented 3 days ago

Baichuan2的大模型可能当前不支持量化吧,使用smoothquant,0.4.2都支持Qwen和Baichuan的那些大模型量化