Open zhaohm14 opened 6 months ago
Currently, the vl model only support turbomind backend which only accepts awq quantization format. As llava has same format with llama, you can use our quant tools to quant the model.
Here is the guide https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/w4a16.md
To quant llava model, you have to modify the code according to this diff https://github.com/InternLM/lmdeploy/commit/0b40aecc5877cd97a0e0622f9cb3fa57298b1d83
By the way, load_4bit
use bitsandbytes which uses dynamic quantitative strategy. It is not very efficient and according to my previous test, it is slower compared with fp16/bf16 format.
Currently, the vl model only support turbomind backend which only accepts awq quantization format. As llava has same format with llama, you can use our quant tools to quant the model.
Here is the guide https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/w4a16.md
To quant llava model, you have to modify the code according to this diff 0b40aec
By the way,
load_4bit
use bitsandbytes which uses dynamic quantitative strategy. It is not very efficient and according to my previous test, it is slower compared with fp16/bf16 format.
非常感谢!量化的程序可以跑起来了,但是会抛出这样一个断言错误:
(lmdeploy) root@ubuntu:~/8h/LLaVA/models# CUDA_VISIBLE_DEVICES=6 lmdeploy lite auto_awq /root/ssd/llava-v1.6-34b --w-group-size 32 --work-dir /root/ssd/llava-v1.6-34b-awq
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 15/15 [00:14<00:00, 1.07it/s]
Move model.embed_tokens to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.layers.24 to CPU.
Move model.layers.25 to CPU.
Move model.layers.26 to CPU.
Move model.layers.27 to CPU.
Move model.layers.28 to CPU.
Move model.layers.29 to CPU.
Move model.layers.30 to CPU.
Move model.layers.31 to CPU.
Move model.layers.32 to CPU.
Move model.layers.33 to CPU.
Move model.layers.34 to CPU.
Move model.layers.35 to CPU.
Move model.layers.36 to CPU.
Move model.layers.37 to CPU.
Move model.layers.38 to CPU.
Move model.layers.39 to CPU.
Move model.layers.40 to CPU.
Move model.layers.41 to CPU.
Move model.layers.42 to CPU.
Move model.layers.43 to CPU.
Move model.layers.44 to CPU.
Move model.layers.45 to CPU.
Move model.layers.46 to CPU.
Move model.layers.47 to CPU.
Move model.layers.48 to CPU.
Move model.layers.49 to CPU.
Move model.layers.50 to CPU.
Move model.layers.51 to CPU.
Move model.layers.52 to CPU.
Move model.layers.53 to CPU.
Move model.layers.54 to CPU.
Move model.layers.55 to CPU.
Move model.layers.56 to CPU.
Move model.layers.57 to CPU.
Move model.layers.58 to CPU.
Move model.layers.59 to CPU.
Move model.norm to GPU.
Move model.vision_tower to GPU.
Move model.mm_projector to GPU.
Move lm_head to CPU.
Loading calibrate dataset ...
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
warnings.warn(
/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
warnings.warn(
Token indices sequence length is longer than the specified maximum sequence length for this model (1140896 > 4096). Running this sequence through the model will result in indexing errors
model.layers.0, samples: 128, max gpu memory: 13.07 GB
model.layers.1, samples: 128, max gpu memory: 16.57 GB
model.layers.2, samples: 128, max gpu memory: 16.57 GB
model.layers.3, samples: 128, max gpu memory: 16.57 GB
model.layers.4, samples: 128, max gpu memory: 16.57 GB
model.layers.5, samples: 128, max gpu memory: 16.57 GB
model.layers.6, samples: 128, max gpu memory: 16.57 GB
model.layers.7, samples: 128, max gpu memory: 16.57 GB
model.layers.8, samples: 128, max gpu memory: 16.57 GB
model.layers.9, samples: 128, max gpu memory: 16.57 GB
model.layers.10, samples: 128, max gpu memory: 16.57 GB
model.layers.11, samples: 128, max gpu memory: 16.57 GB
model.layers.12, samples: 128, max gpu memory: 16.57 GB
model.layers.13, samples: 128, max gpu memory: 16.57 GB
model.layers.14, samples: 128, max gpu memory: 16.57 GB
model.layers.15, samples: 128, max gpu memory: 16.57 GB
model.layers.16, samples: 128, max gpu memory: 16.57 GB
model.layers.17, samples: 128, max gpu memory: 16.57 GB
model.layers.18, samples: 128, max gpu memory: 16.57 GB
model.layers.19, samples: 128, max gpu memory: 16.57 GB
model.layers.20, samples: 128, max gpu memory: 16.57 GB
model.layers.21, samples: 128, max gpu memory: 16.57 GB
model.layers.22, samples: 128, max gpu memory: 16.57 GB
model.layers.23, samples: 128, max gpu memory: 16.57 GB
model.layers.24, samples: 128, max gpu memory: 16.57 GB
model.layers.25, samples: 128, max gpu memory: 16.57 GB
model.layers.26, samples: 128, max gpu memory: 16.57 GB
model.layers.27, samples: 128, max gpu memory: 16.57 GB
model.layers.28, samples: 128, max gpu memory: 16.57 GB
model.layers.29, samples: 128, max gpu memory: 16.57 GB
model.layers.30, samples: 128, max gpu memory: 16.57 GB
model.layers.31, samples: 128, max gpu memory: 16.57 GB
model.layers.32, samples: 128, max gpu memory: 16.57 GB
model.layers.33, samples: 128, max gpu memory: 16.57 GB
model.layers.34, samples: 128, max gpu memory: 16.57 GB
model.layers.35, samples: 128, max gpu memory: 16.57 GB
model.layers.36, samples: 128, max gpu memory: 16.57 GB
model.layers.37, samples: 128, max gpu memory: 16.57 GB
model.layers.38, samples: 128, max gpu memory: 16.57 GB
model.layers.39, samples: 128, max gpu memory: 16.57 GB
model.layers.40, samples: 128, max gpu memory: 16.57 GB
model.layers.41, samples: 128, max gpu memory: 16.57 GB
model.layers.42, samples: 128, max gpu memory: 16.57 GB
model.layers.43, samples: 128, max gpu memory: 16.57 GB
model.layers.44, samples: 128, max gpu memory: 16.57 GB
model.layers.45, samples: 128, max gpu memory: 16.57 GB
model.layers.46, samples: 128, max gpu memory: 16.57 GB
model.layers.47, samples: 128, max gpu memory: 16.57 GB
model.layers.48, samples: 128, max gpu memory: 16.57 GB
model.layers.49, samples: 128, max gpu memory: 16.57 GB
model.layers.50, samples: 128, max gpu memory: 16.57 GB
model.layers.51, samples: 128, max gpu memory: 16.57 GB
model.layers.52, samples: 128, max gpu memory: 16.57 GB
model.layers.53, samples: 128, max gpu memory: 16.57 GB
model.layers.54, samples: 128, max gpu memory: 16.57 GB
model.layers.55, samples: 128, max gpu memory: 16.57 GB
model.layers.56, samples: 128, max gpu memory: 16.57 GB
model.layers.57, samples: 128, max gpu memory: 16.57 GB
model.layers.58, samples: 128, max gpu memory: 16.57 GB
model.layers.59, samples: 128, max gpu memory: 16.57 GB
model.layers.0 smooth weight done.
model.layers.1 smooth weight done.
model.layers.2 smooth weight done.
model.layers.3 smooth weight done.
model.layers.4 smooth weight done.
model.layers.5 smooth weight done.
model.layers.6 smooth weight done.
model.layers.7 smooth weight done.
model.layers.8 smooth weight done.
model.layers.9 smooth weight done.
model.layers.10 smooth weight done.
model.layers.11 smooth weight done.
model.layers.12 smooth weight done.
model.layers.13 smooth weight done.
model.layers.14 smooth weight done.
model.layers.15 smooth weight done.
model.layers.16 smooth weight done.
model.layers.17 smooth weight done.
model.layers.18 smooth weight done.
model.layers.19 smooth weight done.
model.layers.20 smooth weight done.
model.layers.21 smooth weight done.
model.layers.22 smooth weight done.
Traceback (most recent call last):
File "/root/miniconda3/envs/lmdeploy/bin/lmdeploy", line 8, in <module>
sys.exit(run())
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 26, in run
args.run(args)
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
auto_awq(**kwargs)
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 69, in auto_awq
smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size, device)
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 233, in smooth_layers
smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size)
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/quantization/awq.py", line 109, in smooth_ln_fcs
assert torch.isnan(p).sum() == 0
AssertionError
看起来似乎和这个issue相关?https://github.com/InternLM/lmdeploy/issues/243 “Token indices sequence length is longer than the specified maximum sequence length for this model (1140896 > 4096). Running this sequence through the model will result in indexing errors”和这一行的信息是否相关?也许是应该换一个calibrate数据集?
--w-group-size 这个参数不要改,turbomind目前只支持128。我昨天用128试过是可以的
--w-group-size 这个参数不要改,turbomind目前只支持128。我昨天用128试过是可以的
我本地尝试了128 64 32,全部在同一个位置抛出了异常 请问您成功量化后的模型,可以分享一下吗?谢谢!
我昨天试的 llava-v1.5-7b 和 llava-v1.6-vicuna-7b。
llava-v1.6-34b 刚试了下也会报这个错,可能和你提到的那个issue的问题。@pppppM 这个目前有什么解决方法么
模型量化崩了,量化校准导致参数出 nan 值了,可能要调整一下校准策略
模型量化崩了,量化校准导致参数出 nan 值了,可能要调整一下校准策略
ref https://github.com/InternLM/lmdeploy/issues/243#issuecomment-1770503299
while , quantize model Using llm deploy , i am also getting issue lmdeploy lite auto_awq ./llama2-chat-7b-w4 --work-dir ./llama2-chat-7b-4bit
Traceback (most recent call last):
File "/home/userdata/.local/bin/lmdeploy", line 8, in
is "./llama2-chat-7b-w4 " already a quantized model?
Here's what I've done:
and
However, when attempting to load the quantified model as follows, I encounter an error:
Here's the error message:
Despite the error, the GPU memory usage appears to be low (286MiB/22GiB) And this is my pip list:
Thanks a lot for your help!