Open haohuanw opened 8 months ago
Hi @haohuanw,
Yes, LLaVA supports SmoothQuant. You can follow the commands provided in examples/llama/README.md
for enabling SmoothQuant while adding appropriate multimodal arguments like --max_multimodal_len
to trtllm-build
Note that visual component does not need quantization. So you can use the same visual engine with SmoothQuant. Let us know if you face any issues.
thanks! i will give it a try. Do you know if anyone has tried awq int4 method with LLaVA?
INT4 AWQ quantization instructions for LLaMA should work for LLaVA as well.
One minor change in needed in examples/quantization/quantize.py
. Replace this line with
hf_llava = LlavaForConditionalGeneration.from_pretrained(
ckpt_path, device_map="auto", **model_kwargs, trust_remote_code=True)
model = hf_llava.language_model
We will add a note about quantization methods in examples/multimodal/README.md
in next release.
Hi there, i want to follow up little more here. we have a custom trained multi modality model where we see large regressions if directly quantize without injecting multi modality embeddings.
Is there a way that we could inject multi modal embedding during quantization? basically, i would like to know if there is a way to inject embedding after the first projection layer here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/quantization/quantize_by_ammo.py#L226
Hi, @haohuanw if your question Is there a way that we could inject multi modal embedding during quantization?
is about if you can add additional inputs (like the multi modal embeddings) to the model, I think that should be possible. The quantization process does not change the input and output signature.
@cjluo-omniml long time no see! :D
so basically the multimodal model we have is like
embedding = encoder(<inputs>)
embedding_mask = input_ids >= vocab
text_projection = llm.get_input_embedding()(input_ids * (1-mask))
projection = replace_embedding(text_projection, embedding, embedding_mask)
logits = llm(projection)
i am wondering if it is possible to just have model
here be the model after the input embedding stage so that the model can be calibrated with embeddings?
@cjluo-omniml any thoughts and code pointers?
I think it is possible. Have you tried it? In this case, the model here is the llm in your code. And the input calibration data is the projection output.
The only challenge might be exporting this as a TRT LLM checkpoint for deployment. But I think you should at least use AMMO to finish the PTQ.
@cjluo-omniml it works for me. basically if i still initialize the model normally and call with model(inputs_embeds=<embedding after vocab embedding layer>)
everything works out of the box and i am getting reasonable accuracy.
Hi!
I'd like to understand what quantization method that current multi-modal (decoder only) pipeline supports.
From: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#llava it seems that it only supports weight-only method. Is smooth-quant also supported? since https://github.com/NVIDIA/TensorRT-LLM/blob/0f041b7b57d118037f943fd6107db846ed147b91/examples/llama/convert_checkpoint.py#L1400 seems to indicate that llava could also do smooth quant also.
If that's the case, is awq method through ammo also supported for llava?