InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.58k stars 416 forks source link

[Bug] InternVL-Chat-V1-5量化报错 #1660

Closed BigWhiteFox closed 5 months ago

BigWhiteFox commented 5 months ago

Checklist

Describe the bug

w4a16量化和w8a8量化InternVL-Chat-V1-5模型均报错,lmdeploy lite 工具无法识别 InternVL-Chat-V1-5 模型的配置类。

Reproduction

lmdeploy lite smooth_quant /root/models/InternVL-Chat-V1-5 --work-dir /root/models/InternVL-Chat-V1-5-w8a8

lmdeploy lite auto_awq \ /root/models/InternVL-Chat-V1-5 \ --calib-dataset 'ptb' \ --calib-samples 128 \ --calib-seqlen 1024 \ --w-bits 4 \ --w-group-size 128 \ --work-dir /root/models/InternVL-Chat-V1-5-w4a16-4bit

Environment

sys.platform: linux
Python: 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.140
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

LMDeploy: 0.4.1+
transformers: 4.41.1
gradio: 3.50.2
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.2.0

Error traceback

(test) root@intern-studio-40073828:~# lmdeploy lite auto_awq \
>    /root/models/InternVL-Chat-V1-5 \
>   --calib-dataset 'ptb' \
>   --calib-samples 128 \
>   --calib-seqlen 1024 \
>   --w-bits 4 \
>   --w-group-size 128 \
>   --work-dir /root/models/InternVL-Chat-V1-5-w4a16-4bit
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/root/.conda/envs/test/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/root/.conda/envs/test/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/root/.conda/envs/test/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
    auto_awq(**kwargs)
  File "/root/.conda/envs/test/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 55, in auto_awq
    model, tokenizer, work_dir = calibrate(model, calib_dataset, calib_samples,
  File "/root/.conda/envs/test/lib/python3.10/site-packages/lmdeploy/lite/apis/calibrate.py", line 152, in calibrate
    model = load_hf_from_pretrained(model,
  File "/root/.conda/envs/test/lib/python3.10/site-packages/lmdeploy/lite/utils/load.py", line 31, in load_hf_from_pretrained
    model = AutoModelForCausalLM.from_pretrained(
  File "/root/.conda/envs/test/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.InternVL-Chat-V1-5.configuration_internvl_chat.InternVLChatConfig'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

(test) root@intern-studio-40073828:~# lmdeploy lite smooth_quant /root/models/InternVL-Chat-V1-5 --work-dir /root/models/InternVL-Chat-V1-5-w8a8

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/root/.conda/envs/test/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/root/.conda/envs/test/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/root/.conda/envs/test/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 152, in smooth_quant
    smooth_quant(**kwargs)
  File "/root/.conda/envs/test/lib/python3.10/site-packages/lmdeploy/lite/apis/smooth_quant.py", line 74, in smooth_quant
    model, tokenizer, work_dir = calibrate(model, calib_dataset, calib_samples,
  File "/root/.conda/envs/test/lib/python3.10/site-packages/lmdeploy/lite/apis/calibrate.py", line 152, in calibrate
    model = load_hf_from_pretrained(model,
  File "/root/.conda/envs/test/lib/python3.10/site-packages/lmdeploy/lite/utils/load.py", line 31, in load_hf_from_pretrained
    model = AutoModelForCausalLM.from_pretrained(
  File "/root/.conda/envs/test/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.InternVL-Chat-V1-5.configuration_internvl_chat.InternVLChatConfig'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
AllentDan commented 5 months ago

Please wait for lmdeploy v0.4.2.

BigWhiteFox commented 5 months ago

Please wait for lmdeploy v0.4.2.

lmdeploy serve api_server \ /root/models/InternVL-Chat-V1-5 \ --model-format hf \ --quant-policy 4 \ --server-name 0.0.0.0 \ --server-port 23333 \

这个--quant-policy 4是不是也要等0.4.2? 我修改了一下发显存占用和不量化一样的

AllentDan commented 5 months ago

这个不用等。一样的是因为,我们是按照激进的显存分配策略执行的。可以结合 --cache-max-entry-count 参数一起设置,如果这个参数不变,那么即使用的 kv cache 的量化,也会有多少显存用多少。可以设置一个小一点的数,比如 0.4 这种。

BigWhiteFox commented 5 months ago

这个不用等。一样的是因为,我们是按照激进的显存分配策略执行的。可以结合 --cache-max-entry-count 参数一起设置,如果这个参数不变,那么即使用的 kv cache 的量化,也会有多少显存用多少。可以设置一个小一点的数,比如 0.4 这种。

感谢解答 经实验加入--cache-max-entry-count参数设置是可以实现要求。