casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.72k stars 205 forks source link

How is Llava quantized ? #621

Open Abhranta opened 1 month ago

Abhranta commented 1 month ago

In autoawq, do we only quantize the LLM part of Llava or do we also quantize the ViT ? Can we add support for quantizing the vision models like ViT or SIGLIP?

sailfish009 commented 1 week ago

@Abhranta Hi, there is AutoGPTQ :

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

"""
Download https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-lightning-preview to local
Make following edits to the config.json
LlavaLlamaForCausalLM -> LlamaForCausalLM
"model_type": "llava" -> "llama"
"""
pretrained_model_dir = "./checkpoints/llava-llama-2-13b-chat-lightning-preview"

quantized_model_dir = "llava-llama-2-13b-chat-lightning-4bit-128g"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad 
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)
Abhranta commented 1 week ago

Does this quantize only the LLM or the ViT too ?

pratyush0599 commented 1 week ago

Hi @sailfish009, is there no native support for LLava based models. The solution you suggested seems very hacky:( I was also wondering if the quantization happens to the vision encoder too?

sailfish009 commented 5 days ago

@pratyush0599 @Abhranta yes it is language model only. for vision model:

Hope this helps!