NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.19k stars 908 forks source link

How is the support for multiple lora? #1224

Open sleepwalker2017 opened 6 months ago

sleepwalker2017 commented 6 months ago

Is that supported? I mean S-LoRA mechanism.

If yes, is there any example for that? If no, any future plans?

byshiue commented 6 months ago

Here is the document https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.

sleepwalker2017 commented 6 months ago

Here is the document https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.

Hi, I see this.

These two LoRA checkponits add LoRA modules to q_proj and v_proj. Because we only support adding lora modules on q, k and v at the same time

What does this mean? TRT-LLM doesn't support lora on MLP modules? image

As far as I know, there are loras added to the up_proj, down_proj, gate_proj.

Are these fully supported?

And how is the way to support it? Does TRT-LLM use the sgmv kernel to batch the multiple loras ?

byshiue commented 6 months ago

The warning message Because we only support adding lora modules on q, k and v at the same time means that if you want to add q or k or v lora, you need to enable all of them during building engine. It is not related to MLP. Lora for MLP is also supported.

sleepwalker2017 commented 6 months ago

The warning message Because we only support adding lora modules on q, k and v at the same time means that if you want to add q or k or v lora, you need to enable all of them during building engine. It is not related to MLP. Lora for MLP is also supported.

Hello, I see the documentation and I run the example.

It's a basic demo. I still have 2 questions :

  1. Is it implemented using the sgmv kernel? I mean, batching the requests with multiple lora to run inference efficiently.
  2. Is this feature ready for serving? Schedule requests with lora using continuous batching and paged kv cache and so on. Are these supported for multiple lora inference? I checked the code but can't find any code with lora in this benchmark file gptManagerBenchmark.cpp. image

Hope for your reply, thank you!

sleepwalker2017 commented 6 months ago

In case I didn't express myself clearly, I want to describe the scene for multiple lora serving:

When we start server, we load the base model into GPU memory and multiple lora weights to CPU RAM, and when we send requests, we assign the prompt and lora_id, and then the server is able to batch multiple requests to run inference.

Is this supported in latest TRT-LLM now?

byshiue commented 6 months ago
  1. TRT-LLM batches sevearl request and lora into single gemm kernel.
  2. In c++ runtime, users need to put the lora weight pointers and task id first, and TensorRT-LLM will record them in the cache. Then, users only need to pass task id after that. Further feature including CPU offloading is under development.
sleepwalker2017 commented 6 months ago
  1. TRT-LLM batches sevearl request and lora into single gemm kernel.
  2. In c++ runtime, users need to put the lora weight pointers and task id first, and TensorRT-LLM will record them in the cache. Then, users only need to pass task id after that. Further feature including CPU offloading is under development.

The batching of lora requests hasn't been integrated with continuous batching and paged attention, do I get it right? Thank you.

byshiue commented 5 months ago

It is independent to contiguous batching and paged attention. For lora module view, contiguous batching and static batching are same.