InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.15k stars 281 forks source link

[Bug] 量化时候采取默认参数能够正常推理量化,设置了--search-scale True --batch-size 8,量化后无法推理 #1883

Closed AIFFFENG closed 2 days ago

AIFFFENG commented 5 days ago

Checklist

Describe the bug

Traceback (most recent call last): File "/data/54T/多模态大模型加速/lmdeploy_chat_int4_可视化.py", line 42, in pipe = pipeline(model_path, File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/api.py", line 94, in pipeline return pipeline_class(model_path, File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/serve/vl_async_engine.py", line 21, in init super().init(model_path, **kwargs) File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/serve/async_engine.py", line 206, in init self._build_turbomind(model_path=model_path, File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/serve/async_engine.py", line 253, in _build_turbomind self.engine = tm.TurboMind.from_pretrained( File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 387, in from_pretrained return cls(model_path=pretrained_model_name_or_path, File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 161, in init self.model_comm = self._from_hf(model_source=model_source, File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/turbomind/turbomind.py", line 270, in _from_hf output_model = OUTPUT_MODELS.get(output_format)( File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/fp.py", line 26, in init super().init(input_model, cfg, to_file, out_dir) File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/base.py", line 156, in init self.cfg = self.get_config(cfg) File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/target_model/fp.py", line 38, in getconfig w1, , _ = bin.ffn(i) File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/source_model/qwen.py", line 62, in ffn return self._ffn(i, 'weight') File "/data/54T/envs/lmdeploy/lib/python3.8/site-packages/lmdeploy/turbomind/deploy/source_model/qwen.py", line 56, in _ffn tensor = self.params[f'transformer.h.{i}.mlp.{key}.{kind}'] KeyError: 'transformer.h.0.mlp.w2.weight'

Reproduction

python3 -m lmdeploy lite auto_awq /data/54T/luominghua/models/qwen-vl-chat-0625 --calib-samples 128 --search-scale True --batch-size 8 --calib-seqlen 2048 --w-bits 4 --w-group-size 128 --work-dir /data/54T/luominghua/models/qwen-vl-chat-0625-int4-search_batch

Environment

sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7,8,9: NVIDIA RTX A6000
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 10.1, V10.1.24
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.1.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0+cu121
LMDeploy: 0.4.2+
transformers: 4.41.2
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.7.4
triton: 2.1.0

Error traceback

No response

AllentDan commented 2 days ago

复现不了。请确保推理代码,模型路径传的是对的。

AIFFFENG commented 2 days ago

复现不了。请确保推理代码,模型路径传的是对的。

模型路径没有问题的,推理代码如下,两种量化方式都是采取的以下代码 from lmdeploy import pipeline, TurbomindEngineConfig from lmdeploy import pipeline, ChatTemplateConfig,TurbomindEngineConfig engine_config = TurbomindEngineConfig(model_format='awq') from lmdeploy.vl import load_image pipe = pipeline(model_path, chat_template_config=ChatTemplateConfig(model_name='qwen-7b'))#,backend_config=engine_config) gen_config = GenerationConfig(top_p=1, top_k=1, temperature=0.01, max_new_tokens=1024, random_seed=None) begin=time.time() for i in range(1): for name in range(20):#os.listdir(image_dir)[:1]:

image_path = os.path.join(image_dir,name)

    image = load_image(image_path)
    response = pipe((instruct_question, image), gen_config=gen_config,backend_config=engine_config)
    text=response.text
AllentDan commented 2 days ago

复现不了。另外你代码里调用pipe的 __call__ 函数,backend_config 传没有意义。要在用 pipeline 函数的时候调用。

AllentDan commented 2 days ago

这个报错应该是 awq 模型用普通 fp16 的方式启动了,虽然不知道怎么回事。但是你手动传一下 backend_config 到 pipeline 里吧

AIFFFENG commented 2 days ago

这个报错应该是 awq 模型用普通 fp16 的方式启动了,虽然不知道怎么回事。但是你手动传一下 backend_config 到 pipeline 里吧

sorry,刚才传的代码是普通推理的,量化推理确实在pipeline传入了backend_config,我再确认一下,多谢

AIFFFENG commented 2 days ago

这个报错应该是 awq 模型用普通 fp16 的方式启动了,虽然不知道怎么回事。但是你手动传一下 backend_config 到 pipeline 里吧

试了一下,可以了,之前出现这个问题可能是因为显存不足导致的