InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.5k stars 406 forks source link

Error when trying to load quantized llava-v1.6-34b #1418

Open zhaohm14 opened 6 months ago

zhaohm14 commented 6 months ago

Here's what I've done:

  1. Quantized llava-v1.6-34b with the following code:
    from llava.model.builder import load_pretrained_model
    tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path='/root/ssd/llava-v1.6-34b',
    model_base=None,
    model_name='llava-v1.6-34b',
    load_8bit=False,
    load_4bit=True,
    device_map='auto',
    device='cuda',
    use_flash_attn=False
    )
    tokenizer.save_pretrained('/root/ssd/llava-v1.6-34b-int4')
    model.save_pretrained('/root/ssd/llava-v1.6-34b-int4')
  2. Modified config.json 86L "model_type": "llava_llama" -> "llava".
  3. Successfully loaded the quantized model using both:
    from llava.model import LlavaLlamaForCausalLM
    model = LlavaLlamaForCausalLM.from_pretrained('/root/ssd/llava-v1.6-34b-int4')

    and

    from llava.model.builder import load_pretrained_model
    tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path='/root/ssd/llava-v1.6-34b-int4',
    model_base=None,
    model_name='llava-v1.6-34b',
    load_8bit=False,
    load_4bit=True,
    device_map='auto',
    device='cuda',
    use_flash_attn=False
    )
  4. Successfully loaded the vanilla model with:
    from lmdeploy import pipeline, TurbomindEngineConfig
    pipe = pipeline(
    model_name='liuhaotian/llava-v1.6-34b',
    model_path='/root/ssd/llava-v1.6-34b',
    backend_config=TurbomindEngineConfig(
        tp=4,
        model_format='hf',
        session_len=8192,
        cache_max_entry_count=0.1,
    ),
    log_level='INFO'
    )

    However, when attempting to load the quantified model as follows, I encounter an error:

    pipe = pipeline(
    model_name='liuhaotian/llava-v1.6-34b',
    model_path='/root/ssd/llava-v1.6-34b-int4',  # -int4 here
    backend_config=TurbomindEngineConfig(
        tp=4,
        model_format='hf',
        session_len=8192,
        cache_max_entry_count=0.1,
    ),
    log_level='INFO'
    )

    Here's the error message:

    2024-04-10 08:09:02,731 - lmdeploy - INFO - Using turbomind engine
    2024-04-10 08:09:02,731 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name=None, model_format='hf', tp=4, session_len=8192, max_batch_size=128, cache_max_entry_count=0.1, cache_block_seq_len=64, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192)
    2024-04-10 08:09:02,731 - lmdeploy - INFO - input chat_template_config=None
    2024-04-10 08:09:02,731 - lmdeploy - WARNING - Could not find liuhaotian/llava-v1.6-34b-int4 in registered models. Register liuhaotian/llava-v1.6-34b-int4 using the BaseChatTemplate.
    2024-04-10 08:09:02,731 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='liuhaotian/llava-v1.6-34b-int4', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
    2024-04-10 08:09:02,781 - lmdeploy - WARNING - model_source: hf_model
    2024-04-10 08:09:05,017 - lmdeploy - WARNING - model_config:
    [llama]
    model_name = base
    tensor_para_size = 4
    head_num = 56
    kv_head_num = 8
    vocab_size = 64000
    num_layer = 60
    inter_size = 73400320
    norm_eps = 1e-05
    attn_bias = 0
    start_id = 64000
    end_id = 7
    session_len = 8192
    weight_type = fp16
    rotary_embedding = 128
    rope_theta = 5000000.0
    size_per_head = 128
    group_size = 0
    max_batch_size = 128
    max_context_token_num = 1
    step_length = 1
    cache_max_entry_count = 0.1
    cache_block_seq_len = 64
    cache_chunk_size = -1
    num_tokens_per_iter = 8192
    max_prefill_iters = 1
    extra_tokens_per_iter = 0
    use_context_fmha = 1
    quant_policy = 0
    max_position_embeddings = 4096
    rope_scaling_factor = 0.0
    use_dynamic_ntk = 0
    use_logn_attn = 0
    [TM][INFO] Set logger level by INFO
    [TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 8192.
    Exception in thread Thread-6 (_create_weight_func):
    Traceback (most recent call last):
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
    model_comm.create_shared_weights(device_id, rank)
    RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32 
    Exception in thread Thread-7 (_create_weight_func):
    Traceback (most recent call last):
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
    model_comm.create_shared_weights(device_id, rank)
    RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32 
    Exception in thread Thread-4 (_create_weight_func):
    Traceback (most recent call last):
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
    model_comm.create_shared_weights(device_id, rank)
    RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32 
    Exception in thread Thread-5 (_create_weight_func):
    Traceback (most recent call last):
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 196, in _create_weight_func
    model_comm.create_shared_weights(device_id, rank)
    RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32 
    Exception in thread Thread-8 (_get_params):
    Traceback (most recent call last):
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    Exception in thread Thread-9 (_get_params):
    Traceback (most recent call last):
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    Exception in thread Thread-10 (_get_params):
    Traceback (most recent call last):
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    Exception in thread     self._target(*self._args, **self._kwargs)
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
    self.run()
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    Thread-11 (_get_params):
    Traceback (most recent call last):
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()    out = model_comm.get_params(device_id, rank)
    self._target(*self._args, **self._kwargs)
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
    RuntimeError: [TM][ERROR]  Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384 
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self.run()
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
        out = model_comm.get_params(device_id, rank)self._target(*self._args, **self._kwargs)
    File "/root/miniconda3/envs/LMdeploy/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 226, in _get_params
    out = model_comm.get_params(device_id, rank)
    RuntimeError
    RuntimeError: [TM][ERROR]  Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384 
    : [TM][ERROR]  Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384 
    out = model_comm.get_params(device_id, rank)
    RuntimeError: [TM][ERROR]  Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:384

    Despite the error, the GPU memory usage appears to be low (286MiB/22GiB) And this is my pip list:

    Package                   Version
    ------------------------- -----------
    accelerate                0.21.0
    addict                    2.4.0
    aiofiles                  23.2.1
    altair                    5.3.0
    annotated-types           0.6.0
    anyio                     4.3.0
    attrs                     23.2.0
    bitsandbytes              0.43.0
    certifi                   2024.2.2
    charset-normalizer        3.3.2
    click                     8.1.7
    contourpy                 1.2.1
    cycler                    0.12.1
    einops                    0.6.1
    einops-exts               0.0.4
    exceptiongroup            1.2.0
    fastapi                   0.110.1
    ffmpy                     0.3.2
    filelock                  3.13.3
    fire                      0.6.0
    fonttools                 4.50.0
    fsspec                    2024.3.1
    gradio                    4.16.0
    gradio_client             0.8.1
    h11                       0.14.0
    httpcore                  0.17.3
    httpx                     0.24.0
    huggingface-hub           0.22.2
    idna                      3.6
    importlib_metadata        7.1.0
    importlib_resources       6.4.0
    Jinja2                    3.1.3
    joblib                    1.3.2
    jsonschema                4.21.1
    jsonschema-specifications 2023.12.1
    kiwisolver                1.4.5
    llava                     1.2.2.post1
    lmdeploy                  0.3.0
    markdown-it-py            3.0.0
    markdown2                 2.4.13
    MarkupSafe                2.1.5
    matplotlib                3.8.4
    mdurl                     0.1.2
    mmengine-lite             0.10.3
    mpmath                    1.3.0
    networkx                  3.2.1
    numpy                     1.26.4
    nvidia-cublas-cu12        12.1.3.1
    nvidia-cuda-cupti-cu12    12.1.105
    nvidia-cuda-nvrtc-cu12    12.1.105
    nvidia-cuda-runtime-cu12  12.1.105
    nvidia-cudnn-cu12         8.9.2.26
    nvidia-cufft-cu12         11.0.2.54
    nvidia-curand-cu12        10.3.2.106
    nvidia-cusolver-cu12      11.4.5.107
    nvidia-cusparse-cu12      12.1.0.106
    nvidia-nccl-cu12          2.18.1
    nvidia-nvjitlink-cu12     12.4.99
    nvidia-nvtx-cu12          12.1.105
    orjson                    3.10.0
    packaging                 24.0
    pandas                    2.2.1
    peft                      0.9.0
    pillow                    10.3.0
    pip                       23.3.1
    platformdirs              4.2.0
    protobuf                  5.26.1
    psutil                    5.9.8
    pydantic                  2.6.4
    pydantic_core             2.16.3
    pydub                     0.25.1
    Pygments                  2.17.2
    pynvml                    11.5.0
    pyparsing                 3.1.2
    python-dateutil           2.9.0.post0
    python-multipart          0.0.9
    pytz                      2024.1
    PyYAML                    6.0.1
    referencing               0.34.0
    regex                     2023.12.25
    requests                  2.31.0
    rich                      13.7.1
    rpds-py                   0.18.0
    ruff                      0.3.5
    safetensors               0.4.2
    scikit-learn              1.2.2
    scipy                     1.13.0
    semantic-version          2.10.0
    sentencepiece             0.1.99
    setuptools                68.2.2
    shellingham               1.5.4
    shortuuid                 1.0.13
    six                       1.16.0
    sniffio                   1.3.1
    starlette                 0.37.2
    svgwrite                  1.4.3
    sympy                     1.12
    termcolor                 2.4.0
    threadpoolctl             3.4.0
    tiktoken                  0.6.0
    timm                      0.6.13
    tokenizers                0.15.1
    tomli                     2.0.1
    tomlkit                   0.12.0
    toolz                     0.12.1
    torch                     2.1.2
    torchvision               0.16.2
    tqdm                      4.66.2
    transformers              4.37.2
    triton                    2.1.0
    typer                     0.12.0
    typer-cli                 0.12.0
    typer-slim                0.12.0
    typing_extensions         4.10.0
    tzdata                    2024.1
    urllib3                   2.2.1
    uvicorn                   0.29.0
    wavedrom                  2.0.3.post3
    websockets                11.0.3
    wheel                     0.41.2
    yapf                      0.40.2
    zipp                      3.18.1

    Thanks a lot for your help!

irexyc commented 6 months ago

Currently, the vl model only support turbomind backend which only accepts awq quantization format. As llava has same format with llama, you can use our quant tools to quant the model.

Here is the guide https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/w4a16.md

To quant llava model, you have to modify the code according to this diff https://github.com/InternLM/lmdeploy/commit/0b40aecc5877cd97a0e0622f9cb3fa57298b1d83

By the way, load_4bit use bitsandbytes which uses dynamic quantitative strategy. It is not very efficient and according to my previous test, it is slower compared with fp16/bf16 format.

zhaohm14 commented 6 months ago

Currently, the vl model only support turbomind backend which only accepts awq quantization format. As llava has same format with llama, you can use our quant tools to quant the model.

Here is the guide https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/w4a16.md

To quant llava model, you have to modify the code according to this diff 0b40aec

By the way, load_4bit use bitsandbytes which uses dynamic quantitative strategy. It is not very efficient and according to my previous test, it is slower compared with fp16/bf16 format.

非常感谢!量化的程序可以跑起来了,但是会抛出这样一个断言错误:

(lmdeploy) root@ubuntu:~/8h/LLaVA/models# CUDA_VISIBLE_DEVICES=6 lmdeploy lite auto_awq /root/ssd/llava-v1.6-34b --w-group-size 32 --work-dir /root/ssd/llava-v1.6-34b-awq
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 15/15 [00:14<00:00,  1.07it/s]
Move model.embed_tokens to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.layers.24 to CPU.
Move model.layers.25 to CPU.
Move model.layers.26 to CPU.
Move model.layers.27 to CPU.
Move model.layers.28 to CPU.
Move model.layers.29 to CPU.
Move model.layers.30 to CPU.
Move model.layers.31 to CPU.
Move model.layers.32 to CPU.
Move model.layers.33 to CPU.
Move model.layers.34 to CPU.
Move model.layers.35 to CPU.
Move model.layers.36 to CPU.
Move model.layers.37 to CPU.
Move model.layers.38 to CPU.
Move model.layers.39 to CPU.
Move model.layers.40 to CPU.
Move model.layers.41 to CPU.
Move model.layers.42 to CPU.
Move model.layers.43 to CPU.
Move model.layers.44 to CPU.
Move model.layers.45 to CPU.
Move model.layers.46 to CPU.
Move model.layers.47 to CPU.
Move model.layers.48 to CPU.
Move model.layers.49 to CPU.
Move model.layers.50 to CPU.
Move model.layers.51 to CPU.
Move model.layers.52 to CPU.
Move model.layers.53 to CPU.
Move model.layers.54 to CPU.
Move model.layers.55 to CPU.
Move model.layers.56 to CPU.
Move model.layers.57 to CPU.
Move model.layers.58 to CPU.
Move model.layers.59 to CPU.
Move model.norm to GPU.
Move model.vision_tower to GPU.
Move model.mm_projector to GPU.
Move lm_head to CPU.
Loading calibrate dataset ...
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Token indices sequence length is longer than the specified maximum sequence length for this model (1140896 > 4096). Running this sequence through the model will result in indexing errors
model.layers.0, samples: 128, max gpu memory: 13.07 GB
model.layers.1, samples: 128, max gpu memory: 16.57 GB
model.layers.2, samples: 128, max gpu memory: 16.57 GB
model.layers.3, samples: 128, max gpu memory: 16.57 GB
model.layers.4, samples: 128, max gpu memory: 16.57 GB
model.layers.5, samples: 128, max gpu memory: 16.57 GB
model.layers.6, samples: 128, max gpu memory: 16.57 GB
model.layers.7, samples: 128, max gpu memory: 16.57 GB
model.layers.8, samples: 128, max gpu memory: 16.57 GB
model.layers.9, samples: 128, max gpu memory: 16.57 GB
model.layers.10, samples: 128, max gpu memory: 16.57 GB
model.layers.11, samples: 128, max gpu memory: 16.57 GB
model.layers.12, samples: 128, max gpu memory: 16.57 GB
model.layers.13, samples: 128, max gpu memory: 16.57 GB
model.layers.14, samples: 128, max gpu memory: 16.57 GB
model.layers.15, samples: 128, max gpu memory: 16.57 GB
model.layers.16, samples: 128, max gpu memory: 16.57 GB
model.layers.17, samples: 128, max gpu memory: 16.57 GB
model.layers.18, samples: 128, max gpu memory: 16.57 GB
model.layers.19, samples: 128, max gpu memory: 16.57 GB
model.layers.20, samples: 128, max gpu memory: 16.57 GB
model.layers.21, samples: 128, max gpu memory: 16.57 GB
model.layers.22, samples: 128, max gpu memory: 16.57 GB
model.layers.23, samples: 128, max gpu memory: 16.57 GB
model.layers.24, samples: 128, max gpu memory: 16.57 GB
model.layers.25, samples: 128, max gpu memory: 16.57 GB
model.layers.26, samples: 128, max gpu memory: 16.57 GB
model.layers.27, samples: 128, max gpu memory: 16.57 GB
model.layers.28, samples: 128, max gpu memory: 16.57 GB
model.layers.29, samples: 128, max gpu memory: 16.57 GB
model.layers.30, samples: 128, max gpu memory: 16.57 GB
model.layers.31, samples: 128, max gpu memory: 16.57 GB
model.layers.32, samples: 128, max gpu memory: 16.57 GB
model.layers.33, samples: 128, max gpu memory: 16.57 GB
model.layers.34, samples: 128, max gpu memory: 16.57 GB
model.layers.35, samples: 128, max gpu memory: 16.57 GB
model.layers.36, samples: 128, max gpu memory: 16.57 GB
model.layers.37, samples: 128, max gpu memory: 16.57 GB
model.layers.38, samples: 128, max gpu memory: 16.57 GB
model.layers.39, samples: 128, max gpu memory: 16.57 GB
model.layers.40, samples: 128, max gpu memory: 16.57 GB
model.layers.41, samples: 128, max gpu memory: 16.57 GB
model.layers.42, samples: 128, max gpu memory: 16.57 GB
model.layers.43, samples: 128, max gpu memory: 16.57 GB
model.layers.44, samples: 128, max gpu memory: 16.57 GB
model.layers.45, samples: 128, max gpu memory: 16.57 GB
model.layers.46, samples: 128, max gpu memory: 16.57 GB
model.layers.47, samples: 128, max gpu memory: 16.57 GB
model.layers.48, samples: 128, max gpu memory: 16.57 GB
model.layers.49, samples: 128, max gpu memory: 16.57 GB
model.layers.50, samples: 128, max gpu memory: 16.57 GB
model.layers.51, samples: 128, max gpu memory: 16.57 GB
model.layers.52, samples: 128, max gpu memory: 16.57 GB
model.layers.53, samples: 128, max gpu memory: 16.57 GB
model.layers.54, samples: 128, max gpu memory: 16.57 GB
model.layers.55, samples: 128, max gpu memory: 16.57 GB
model.layers.56, samples: 128, max gpu memory: 16.57 GB
model.layers.57, samples: 128, max gpu memory: 16.57 GB
model.layers.58, samples: 128, max gpu memory: 16.57 GB
model.layers.59, samples: 128, max gpu memory: 16.57 GB
model.layers.0 smooth weight done.
model.layers.1 smooth weight done.
model.layers.2 smooth weight done.
model.layers.3 smooth weight done.
model.layers.4 smooth weight done.
model.layers.5 smooth weight done.
model.layers.6 smooth weight done.
model.layers.7 smooth weight done.
model.layers.8 smooth weight done.
model.layers.9 smooth weight done.
model.layers.10 smooth weight done.
model.layers.11 smooth weight done.
model.layers.12 smooth weight done.
model.layers.13 smooth weight done.
model.layers.14 smooth weight done.
model.layers.15 smooth weight done.
model.layers.16 smooth weight done.
model.layers.17 smooth weight done.
model.layers.18 smooth weight done.
model.layers.19 smooth weight done.
model.layers.20 smooth weight done.
model.layers.21 smooth weight done.
model.layers.22 smooth weight done.
Traceback (most recent call last):
  File "/root/miniconda3/envs/lmdeploy/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 26, in run
    args.run(args)
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
    auto_awq(**kwargs)
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 69, in auto_awq
    smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size, device)
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 233, in smooth_layers
    smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size)
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 109, in smooth_ln_fcs
    assert torch.isnan(p).sum() == 0
AssertionError

看起来似乎和这个issue相关?https://github.com/InternLM/lmdeploy/issues/243 “Token indices sequence length is longer than the specified maximum sequence length for this model (1140896 > 4096). Running this sequence through the model will result in indexing errors”和这一行的信息是否相关?也许是应该换一个calibrate数据集?

irexyc commented 6 months ago

--w-group-size 这个参数不要改,turbomind目前只支持128。我昨天用128试过是可以的

zhaohm14 commented 6 months ago

--w-group-size 这个参数不要改,turbomind目前只支持128。我昨天用128试过是可以的

我本地尝试了128 64 32,全部在同一个位置抛出了异常 请问您成功量化后的模型,可以分享一下吗?谢谢!

irexyc commented 6 months ago

我昨天试的 llava-v1.5-7b 和 llava-v1.6-vicuna-7b。

llava-v1.6-34b 刚试了下也会报这个错,可能和你提到的那个issue的问题。@pppppM 这个目前有什么解决方法么

pppppM commented 6 months ago

模型量化崩了,量化校准导致参数出 nan 值了,可能要调整一下校准策略

zhyncs commented 6 months ago

模型量化崩了,量化校准导致参数出 nan 值了,可能要调整一下校准策略

ref https://github.com/InternLM/lmdeploy/issues/243#issuecomment-1770503299

SurenderSardana99 commented 5 months ago

while , quantize model Using llm deploy , i am also getting issue lmdeploy lite auto_awq ./llama2-chat-7b-w4 --work-dir ./llama2-chat-7b-4bit

Traceback (most recent call last): File "/home/userdata/.local/bin/lmdeploy", line 8, in sys.exit(run()) File "/home/userdata/.local/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run args.run(args) File "/home/userdata/.local/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq auto_awq(*kwargs) File "/home/userdata/.local/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 68, in auto_awq smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size, device) File "/home/userdata/.local/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 242, in smooth_layers smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size) File "/home/userdata/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/home/userdata/.local/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 118, in smooth_ln_fcs assert torch.isnan(p).sum() == 0 AssertionError

lvhan028 commented 5 months ago

is "./llama2-chat-7b-w4 " already a quantized model?