Open sleepwalker2017 opened 6 months ago
Here is the document https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.
Hi, I see this.
These two LoRA checkponits add LoRA modules to q_proj and v_proj.
Because we only support adding lora modules on q, k and v at the same time
What does this mean? TRT-LLM doesn't support lora on MLP modules?
As far as I know, there are loras added to the up_proj
, down_proj
, gate_proj
.
Are these fully supported?
And how is the way to support it? Does TRT-LLM use the sgmv kernel to batch the multiple loras ?
The warning message Because we only support adding lora modules on q, k and v at the same time
means that if you want to add q or k or v lora, you need to enable all of them during building engine. It is not related to MLP. Lora for MLP is also supported.
The warning message
Because we only support adding lora modules on q, k and v at the same time
means that if you want to add q or k or v lora, you need to enable all of them during building engine. It is not related to MLP. Lora for MLP is also supported.
Hello, I see the documentation and I run the example.
It's a basic demo. I still have 2 questions :
gptManagerBenchmark.cpp
.
Hope for your reply, thank you!
In case I didn't express myself clearly, I want to describe the scene for multiple lora serving:
When we start server, we load the base model into GPU memory and multiple lora weights to CPU RAM, and when we send requests, we assign the prompt and lora_id, and then the server is able to batch multiple requests to run inference.
Is this supported in latest TRT-LLM now?
- TRT-LLM batches sevearl request and lora into single gemm kernel.
- In c++ runtime, users need to put the lora weight pointers and task id first, and TensorRT-LLM will record them in the cache. Then, users only need to pass task id after that. Further feature including CPU offloading is under development.
The batching of lora requests hasn't been integrated with continuous batching and paged attention, do I get it right? Thank you.
It is independent to contiguous batching and paged attention. For lora module view, contiguous batching and static batching are same.
Is that supported? I mean S-LoRA mechanism.
If yes, is there any example for that? If no, any future plans?