[Question] Quantization method multi-modal (llava) supports

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.61k stars 979 forks source link

[Question] Quantization method multi-modal (llava) supports #1101

Open haohuanw opened 8 months ago

haohuanw commented 8 months ago

Hi!

I'd like to understand what quantization method that current multi-modal (decoder only) pipeline supports.

From: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#llava it seems that it only supports weight-only method. Is smooth-quant also supported? since https://github.com/NVIDIA/TensorRT-LLM/blob/0f041b7b57d118037f943fd6107db846ed147b91/examples/llama/convert_checkpoint.py#L1400 seems to indicate that llava could also do smooth quant also.

If that's the case, is awq method through ammo also supported for llava?

amukkara commented 8 months ago

Hi @haohuanw, Yes, LLaVA supports SmoothQuant. You can follow the commands provided in examples/llama/README.md for enabling SmoothQuant while adding appropriate multimodal arguments like --max_multimodal_len to trtllm-build

Note that visual component does not need quantization. So you can use the same visual engine with SmoothQuant. Let us know if you face any issues.

haohuanw commented 8 months ago

thanks! i will give it a try. Do you know if anyone has tried awq int4 method with LLaVA?

amukkara commented 8 months ago

INT4 AWQ quantization instructions for LLaMA should work for LLaVA as well.

One minor change in needed in examples/quantization/quantize.py. Replace this line with

hf_llava = LlavaForConditionalGeneration.from_pretrained(
                ckpt_path, device_map="auto", **model_kwargs, trust_remote_code=True)
model = hf_llava.language_model

We will add a note about quantization methods in examples/multimodal/README.md in next release.

haohuanw commented 8 months ago

Hi there, i want to follow up little more here. we have a custom trained multi modality model where we see large regressions if directly quantize without injecting multi modality embeddings.

Is there a way that we could inject multi modal embedding during quantization? basically, i would like to know if there is a way to inject embedding after the first projection layer here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/quantization/quantize_by_ammo.py#L226

cjluo-omniml commented 8 months ago

Hi, @haohuanw if your question Is there a way that we could inject multi modal embedding during quantization? is about if you can add additional inputs (like the multi modal embeddings) to the model, I think that should be possible. The quantization process does not change the input and output signature.

haohuanw commented 8 months ago

@cjluo-omniml long time no see! :D

so basically the multimodal model we have is like

embedding = encoder(<inputs>)
embedding_mask = input_ids >= vocab
text_projection = llm.get_input_embedding()(input_ids * (1-mask))
projection = replace_embedding(text_projection, embedding, embedding_mask)
logits = llm(projection)

i am wondering if it is possible to just have model here be the model after the input embedding stage so that the model can be calibrated with embeddings?

haohuanw commented 7 months ago

@cjluo-omniml any thoughts and code pointers?

cjluo-omniml commented 7 months ago

I think it is possible. Have you tried it? In this case, the model here is the llm in your code. And the input calibration data is the projection output.

The only challenge might be exporting this as a TRT LLM checkpoint for deployment. But I think you should at least use AMMO to finish the PTQ.

haohuanw commented 7 months ago

@cjluo-omniml it works for me. basically if i still initialize the model normally and call with model(inputs_embeds=<embedding after vocab embedding layer>) everything works out of the box and i am getting reasonable accuracy.