How to integrate Multi-LoRA Setup at Inference with NVIDIA Triton / TensorRT-LLM? I built the engine...

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.66k stars 989 forks source link

How to integrate Multi-LoRA Setup at Inference with NVIDIA Triton / TensorRT-LLM? I built the engine... #2371

Open JoJoLev opened 3 weeks ago

JoJoLev commented 3 weeks ago

I built the engine, and had two separate LoRA layers with the base llama3.1 model. The output from the build is rank0.engine, config.json, and then a lora folder with the following structure: lora		__>0		_> adapter_config.json	__	_> adapter_model.safetensors
_>1
	_> adapter_config.json
__	_> adapter_model.safetensors

Is this expected? I figured there would be rank engines? I passed these in the lora directory on the engine build: trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_tp1 --output_dir /opt/tensorrt_llm_engine --gemm_plugin auto --lora_plugin auto --max_batch_size 8 --max_input_len 512 --max_seq_len 562 --lora_dir "/opt/lora_1" "/opt/lora_2" --max_lora_rank 8 --lora_target_modules attn_q attn_k attn_v

Any advice is appreciated.

syuoni commented 2 weeks ago

Hi @JoJoLev ,

I suppose the output folder is expected. You built the eigine with TP=1, and there was one rank0.engine. The LoRA weights are saved in adapter_model.safetensors under each LoRA folder.

JoJoLev commented 2 weeks ago

Hi @syuoni

Thanks for the response. Yes, I have a rank0.engine file, and config. My question now is that when I deploy on to a container, say NVIDIA Triton, do I have to include the LoRA weights? Or have those been baked in to the rank0.engine?

syuoni commented 2 weeks ago

Hi @syuoni

Thanks for the response. Yes, I have a rank0.engine file, and config. My question now is that when I deploy on to a container, say NVIDIA Triton, do I have to include the LoRA weights? Or have those been baked in to the rank0.engine?

Yes, you have to include the LoRA weights. They are not baked into the engine because TRT-LLM supports multi-lora, so has to load LoRA weights dynamically at runtime.

JoJoLev commented 2 weeks ago

@syuoni got it!

Thank you! So, after running my engine build I had the aforementioned folder structure, if I was to deploy on NVIDIA Triton, would I include the LoRA weights in the 1/ sub folder where my rank0.engine file and config.json are? Or would I be placed on a different path? I believe this is the container we are going with on deployment.

syuoni commented 2 weeks ago

Hi @JoJoLev ,

The lora weights under the engine folder (i.e., lora/0/ and lora/1/) are used for ModelRunner(Cpp). For example, run.py script.

To use lora for Triton server, we need to convert the lora weights to the format required by C++ runtime. Please follow the steps shown in this doc. Also, I'd like to remind that the Triton server deployment need to be in the container provided by TensorRT-LLM Backend, see here.

JoJoLev commented 1 week ago

@syuoni , got it. I just converted the LoRA files to the format with tensorrt-LLM. I now have the engine build with lots and have converted the Lora.

When constructing the triton, where do the files go?

Typically with triton there is a structure of:

Path		__>1->rank0.engine		_>config.json
__> config.pbtxt

Where would the lora files go?

syuoni commented 1 week ago

Please see this doc: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/lora.md#launch-tritonserver

Once lora checkpoints are converted by hf_lora_convert.py, they can be passed to inflight_batcher_llm_client.py via --lora-path

JoJoLev commented 1 week ago

Hi @syuoni

Thanks for the support! For an update, this works, got the inference running per the tutorial you shared.

Now, the thing I need to configure is how to properly construct this setup to deploy on Sagemaker. For the config.pbtxt in NVIDIA Triton, would I call the Lora paths there? Or would they be in file structure for triton when I deploy endpoint?

Thanks!

syuoni commented 1 week ago

Hi @JoJoLev ,

I don't think I've fully understood your questions. I would say that lora paths are not closely related to config.pbtxt.

Once lora checkpoints are processed by hf_lora_convert.py, the processed lora paths can be used for Triton. They should be specified via --lora-path to inflight_batcher_llm_client.py. To clarify, the lora paths should be in the client side.