Open JoJoLev opened 3 weeks ago
Hi @JoJoLev ,
I suppose the output folder is expected. You built the eigine with TP=1, and there was one rank0.engine
. The LoRA weights are saved in adapter_model.safetensors
under each LoRA folder.
Hi @syuoni
Thanks for the response. Yes, I have a rank0.engine file, and config. My question now is that when I deploy on to a container, say NVIDIA Triton, do I have to include the LoRA weights? Or have those been baked in to the rank0.engine?
Hi @syuoni
Thanks for the response. Yes, I have a rank0.engine file, and config. My question now is that when I deploy on to a container, say NVIDIA Triton, do I have to include the LoRA weights? Or have those been baked in to the rank0.engine?
Yes, you have to include the LoRA weights. They are not baked into the engine because TRT-LLM supports multi-lora, so has to load LoRA weights dynamically at runtime.
@syuoni got it!
Thank you! So, after running my engine build I had the aforementioned folder structure, if I was to deploy on NVIDIA Triton, would I include the LoRA weights in the 1/ sub folder where my rank0.engine file and config.json are? Or would I be placed on a different path? I believe this is the container we are going with on deployment.
Hi @JoJoLev ,
The lora weights under the engine folder (i.e., lora/0/
and lora/1/
) are used for ModelRunner(Cpp)
. For example, run.py
script.
To use lora for Triton server, we need to convert the lora weights to the format required by C++ runtime. Please follow the steps shown in this doc. Also, I'd like to remind that the Triton server deployment need to be in the container provided by TensorRT-LLM Backend, see here.
@syuoni , got it. I just converted the LoRA files to the format with tensorrt-LLM. I now have the engine build with lots and have converted the Lora.
When constructing the triton, where do the files go?
Typically with triton there is a structure of:
Path | __>1->rank0.engine | _>config.json | ||
---|---|---|---|---|
__> config.pbtxt |
Where would the lora files go?
Please see this doc: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/lora.md#launch-tritonserver
Once lora checkpoints are converted by hf_lora_convert.py
, they can be passed to inflight_batcher_llm_client.py
via --lora-path
Hi @syuoni
Thanks for the support! For an update, this works, got the inference running per the tutorial you shared.
Now, the thing I need to configure is how to properly construct this setup to deploy on Sagemaker. For the config.pbtxt in NVIDIA Triton, would I call the Lora paths there? Or would they be in file structure for triton when I deploy endpoint?
Thanks!
Hi @JoJoLev ,
I don't think I've fully understood your questions. I would say that lora paths are not closely related to config.pbtxt.
Once lora checkpoints are processed by hf_lora_convert.py
, the processed lora paths can be used for Triton. They should be specified via --lora-path
to inflight_batcher_llm_client.py
. To clarify, the lora paths should be in the client side.
Is this expected? I figured there would be rank engines? I passed these in the lora directory on the engine build: trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_tp1 --output_dir /opt/tensorrt_llm_engine --gemm_plugin auto --lora_plugin auto --max_batch_size 8 --max_input_len 512 --max_seq_len 562 --lora_dir "/opt/lora_1" "/opt/lora_2" --max_lora_rank 8 --lora_target_modules attn_q attn_k attn_v
Any advice is appreciated.