Open pankajroark opened 2 months ago
@pankajroark The engine itself supports fp32 LoRA, so this runtime limitation is unnecessary. I can help to add support for fp32 LoRA, can you provide your model + LoRA checkpoint + commands (use similar open-source alternative if your model is private) to me for validation?
Thanks, great to know that the engine supports fp32 LoRA. The model is indeed private, let me provide details shortly in the oss alternative.
Would appreciate any updates on this issue. thx
@pankajroark I cannot access the fp32 LoRA link you provided, it may be a private repo. After some investigation, I find that the lora plugin only supports fp32 base model + fp32 lora now, so simply removing the runtime limitation is not enough to run fp16 base model + fp32 lora. We have to update lora plugin to support it, which makes it a feature request instead of a quick bugfix. We will try to allocate engineer bandwidth for it, but cannot promise to finish in v0.12.
Thanks for the update.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
Currently, TensorRT-LLM requires that LoRA weights dtype match the base model dtype. The check is here: https://github.com/NVIDIA/TensorRT-LLM/blob/9dbc5b38baba399c5517685ecc5b66f57a177a4c/cpp/tensorrt_llm/runtime/loraUtils.cpp#L66
One way around is to quantize LoRA before passing to TensorRT-LLM. But this results in unacceptably lower quality. LoRA matrices get multiplied, quantizing them from fp32 to fp16 beforehand multiples the quality loss. Whereas quantizing after the multiplication is much better. We experimented with merging the LoRA weights into the basemodel and we didn't see any quality degradation there, because the LoRA merge happens at fp32 there.
It would be best if TensorRT-LLM could accept fp32 LoRA, multiply LoRA low rank matrices in fp32 and quantize the multiplied matrix to confirm to fp16 of base model. This way the quantization loss will be much lower.