[Bug] Error when trying to load awq llava 1.5 13b model

isaac-vidas commented 4 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.

Describe the bug

Hi 👋, I quantized llava 1.5 13b model with the following command:

lmdeploy lite auto_awq models/llava-v1.5-13b/ \
  --w-group-size 128 \
  --work-dir awq/llava-v1.5-13b-4bit2

When trying to run the following code:

pipe = pipeline(
        'awq/llava-v1.5-13b-4bit2',
        chat_template_config=ChatTemplateConfig(model_name='vicuna'),
        backend_config=TurbomindEngineConfig(
            model_format='awq'
        )
)

Encounter the following error:

$ python main.py 
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 3096.57it/s]
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Traceback (most recent call last):
  File "/home/gcpuser/sky_workdir/main.py", line 91, in <module>
    main()
  File "/home/gcpuser/sky_workdir/main.py", line 57, in main
    pipe = pipeline(
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/api.py", line 94, in pipeline
    return pipeline_class(model_path,
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 16, in __init__
    self.vl_encoder = ImageEncoder(model_path)
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 68, in __init__
    self.model = load_vl_model(model_path)
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/vl/model/builder.py", line 36, in load_vl_model
    return LlavaVisionModel(model_path)
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/vl/model/llava.py", line 37, in __init__
    self.build_model()
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/vl/model/llava.py", line 54, in build_model
    model = LlavaLlamaForCausalLM.from_pretrained(self.model_path)
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3740, in _load_pretrained_model
    set_module_tensor_to_device(model, key, "cpu", value)
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 399, in set_module_tensor_to_device
    new_value = value.to(device)
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

Additional notes:

I followed the instructions to quantize the model here and here
I also tried with the TheBloke/llava-v1.5-13B-AWQ model with the same issue.
When I first tried to run the code it was complaining about missing awq.modules. I installed lmdeploy[all] and also manually installed autoawq. I tried with both autoawq==0.2.4 and autoawq==0.2.2 and encountered the same error.
The model that I'm quantizing is a finetuned version of llava-1.5v-13b but this doesn't seem to matter since TheBloke's model also fails with the same issue.
Specifically, the key that's failing in accelerate/utils/modeling.py is model.mm_projector.0.qweight which is the first one.
Running with lmdeploy 0.4.0 on A100-40GB

Did anyone else encounter this?

Reproduction

Described in the previous section

Environment

$ lmdeploy check_env
sys.platform: linux
Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Debian 10.2.1-6) 10.2.1 20210110
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.4  (built against CUDA 12.2)
    - Built with CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

LMDeploy: 0.4.0+
transformers: 4.38.2
gradio: 3.50.2
fastapi: 0.110.2
pydantic: 2.7.1
triton: 2.2.0

Error traceback

Described above in bug details.

AllentDan commented 4 months ago

https://lmdeploy.readthedocs.io/en/latest/supported_models/supported_models.html currently, settings except fp16 are not guaranteed.

irexyc commented 4 months ago

Could you check if it will work if replacing this line as below ?

# with torch.device('meta'), warnings.catch_warnings():
from accelerate import init_empty_weights
with init_empty_weights(), warnings.catch_warnings():

BTW, I don't think you need to install autoawq

isaac-vidas commented 4 months ago

Thanks @irexyc and @AllentDan!

I tried the change @irexyc suggested and it passed the previous part until it hits the following error:

$ python main.py 
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 3089.45it/s]
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.11it/s]
Some weights of the model checkpoint at /home/gcpuser/sky_workdir/awq/llava-v1.5-13b-4bit were not used when initializing LlavaLlamaForCausalLM: ['model.mm_projector.0.weight', 'model.mm_projector.2.weight']
- This IS expected if you are initializing LlavaLlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlavaLlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LlavaLlamaForCausalLM were not initialized from the model checkpoint at /home/gcpuser/sky_workdir/awq/llava-v1.5-13b-4bit and are newly initialized: ['model.mm_projector.0.qweight', 'model.mm_projector.0.qzeros', 'model.mm_projector.0.scales']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/home/gcpuser/sky_workdir/main.py", line 91, in <module>
    main()
  File "/home/gcpuser/sky_workdir/main.py", line 57, in main
    pipe = pipeline(
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/api.py", line 94, in pipeline
    return pipeline_class(model_path,
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 16, in __init__
    self.vl_encoder = ImageEncoder(model_path)
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 68, in __init__
    self.model = load_vl_model(model_path)
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/vl/model/builder.py", line 36, in load_vl_model
    return LlavaVisionModel(model_path)
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/vl/model/llava.py", line 37, in __init__
    self.build_model()
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/lmdeploy/vl/model/llava.py", line 69, in build_model
    model.to(self.device).eval().half()
  File "/opt/conda/envs/lmdeploy_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2561, in half
    raise ValueError(
ValueError: `.half()` is not supported for quantized model. Please use the model as it is, since the model has already been casted to the correct `dtype`.

irexyc commented 4 months ago

Could you provide the config.json file ?

isaac-vidas commented 4 months ago

This is the config for the quantized model:

{
  "_name_or_path": "models/llava-v1.5-13b/",
  "architectures": [
    "LlavaLlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "fp16": true,
  "freeze_mm_mlp_adapter": false,
  "freeze_mm_vision_resampler": false,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "image_aspect_ratio": "pad",
  "initializer_range": 0.02,
  "intermediate_size": 13824,
  "max_length": 4096,
  "max_position_embeddings": 4096,
  "mm_hidden_size": 1024,
  "mm_patch_merge_type": "flat",
  "mm_projector_lr": null,
  "mm_projector_type": "mlp2x_gelu",
  "mm_resampler_type": null,
  "mm_use_im_patch_token": false,
  "mm_use_im_start_end": false,
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "openai/clip-vit-large-patch14-336",
  "model_type": "llava",
  "num_attention_heads": 40,
  "num_hidden_layers": 40,
  "num_key_value_heads": 40,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "group_size": 128,
    "quant_method": "awq",
    "version": "gemm",
    "zero_point": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 4096,
  "tokenizer_padding_side": "right",
  "torch_dtype": "float16",
  "transformers_version": "4.40.1",
  "tune_mm_mlp_adapter": false,
  "tune_mm_vision_resampler": false,
  "unfreeze_mm_vision_tower": false,
  "use_cache": false,
  "use_mm_proj": true,
  "vocab_size": 32000
}

irexyc commented 4 months ago

You can try if it will work after removing quantization_config in the config

  "quantization_config": {
    "bits": 4,
    "group_size": 128,
    "quant_method": "awq",
    "version": "gemm",
    "zero_point": true
  },

@ AllentDan will add better quantization support for vlm model next mounth.

isaac-vidas commented 4 months ago

It's working now 🎉 Thanks for all the help and this awesome library!

I'll add this hack until additional support will be added to the code.

Closing this for now.

zjysteven commented 3 months ago

@isaac-vidas Hello, may I ask what version of lmdeploy were you using when quantizing llava? I somehow encountered the problem of transformers not having llava model registered error, described here #1601.

InternLM / lmdeploy