NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.16k stars 902 forks source link

How to load multiple Lora weights and multiple text inputs to inference? #599

Open jkl375 opened 9 months ago

jkl375 commented 9 months ago

How to load multiple Lora weights and multiple text inputs to inference? Currently, only single Lora weights and input tokens are supported as inputs. How to support multiple Lora weights and input tokens as inputs for batch inference? https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/lora_manager.py#L234 When I inference with two text_input and two same lora weights as follow:

mpirun -n 2 --allow-run-as-root python ../run.py --engine_dir "/tmp/new_lora_7b/trt_engines/fp16/2-gpu/" \
              --max_output_len 50 \
              --temperature 1 \
              --tokenizer_dir "/workspace/qllama-7b-chat" \
              --input_text "你好" "你是谁?" \
              --lora_dir "/workspace/offline" \
              --lora_task_uids 0 0 \
              --no_add_special_tokens

the error occurred:

File "/workspace/examples/llama/../run.py", line 285, in main
    outputs = runner.generate(batch_input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 364, in generate
    batch_input_ids, input_lengths = self._prepare_inputs(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 279, in _prepare_inputs
    raise RuntimeError(
RuntimeError: Input batch size (2) exceeds the engine limit (1)
Traceback (most recent call last):
  File "/workspace/examples/llama/../run.py", line 339, in <module>
    main(args)
  File "/workspace/examples/llama/../run.py", line 285, in main
    outputs = runner.generate(batch_input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 364, in generate
    batch_input_ids, input_lengths = self._prepare_inputs(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 279, in _prepare_inputs
    raise RuntimeError(
RuntimeError: Input batch size (2) exceeds the engine limit (1)

Later, I set self.max_batch_size = 2,error occurred:

[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
[12/07/2023-01:33:56] [TRT] [E] 3: [executionContext.cpp::resolveSlots::2991] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::resolveSlots::2991, condition: allInputDimensionsSpecified(routine) )
Traceback (most recent call last):
  File "/workspace/examples/llama/../run.py", line 339, in <module>
    main(args)
  File "/workspace/examples/llama/../run.py", line 285, in main
    outputs = runner.generate(batch_input_ids,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner.py", line 387, in generate
    outputs = self.session.decode(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 639, in wrapper
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2208, in decode
    return self.decode_regular(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1947, in decode_regular
    should_stop, next_step_buffer, tasks, context_lengths, host_context_lengths, attention_mask, logits, encoder_input_lengths = self.handle_per_step(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1712, in handle_per_step
    raise RuntimeError('Executing TRT engine failed!')
RuntimeError: Executing TRT engine failed!

Should I build different engines for different batch sizes?

byshiue commented 9 months ago

Could you share the script to build engine?

jkl375 commented 9 months ago
python build.py --model_dir /workspace/qllama-7b-chat \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir "/tmp/new_lora_7b/trt_engines/fp16/2-gpu/" \
                --max_batch_size 1 \
                --max_input_len 512 \
                --max_output_len 50 \
                --use_lora_plugin float16 \
                --visualize  \
                --hf_lora_dir "/workspace/linker_g0TZGi36_best" \
                --world_size 2 --tp_size 2

/workspace/linker_g0TZGi36_best is lora wight by finetuning

byshiue commented 9 months ago

If you want to run batch size > 1, you should set max_batch_size during engine building.

jkl375 commented 9 months ago

Thank you, but when is it expected to support loading multiple Lora weights?

codybum commented 9 months ago

Using multiple Lora weights independently and merged would be a very important feature.

byshiue commented 9 months ago

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

WangxuP commented 8 months ago

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

byshiue commented 8 months ago

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

As I mentioned, users need to modify lora_manager.py to load several lora models.

WangxuP commented 8 months ago

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

As I mentioned, users need to modify lora_manager.py to load several lora models.

Are there any examples for me to refer to? thank you very much!

byshiue commented 8 months ago

No. We don't find better model to prepare the example. If you could share any checkpoints with several lora weights, we are happy to prepare such example.

hchoi-moveworks commented 7 months ago

The core feature is supported, but we don't have checkpoint to demonstrate. You could modify the lora_manager to load multiple lora weights.

I read lora_manager.py source code. It was found that the LoraConfig class can only load one Lora model and does not show the ability to load multiple lora models.

As I mentioned, users need to modify lora_manager.py to load several lora models.

@byshiue if we modify lora_manager to load multiple lora adapters, do we share the same base model? Or would the model weights for the base model also be repeated?

byshiue commented 7 months ago

They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.

WangxuP commented 7 months ago

它们共享相同的基本模型。我们在这里有一个示例https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints。

ok, thanks a lot!

hchoi-moveworks commented 7 months ago

They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.

Thanks @byshiue !!

In the example 1, build script only specify one hf_lora_dir . Should this be 2 lora dirs?? (To be consistent with run script and the readme description ? )

python build.py --model_dir ${BASE_LLAMA_MODEL} \
              ....
                --hf_lora_dir "Japanese-Alpaca-LoRA-7b-v0/" \
byshiue commented 6 months ago

They share same base model. We have an example here https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-several-lora-checkpoints.

Thanks @byshiue !!

In the example 1, build script only specify one hf_lora_dir . Should this be 2 lora dirs?? (To be consistent with run script and the readme description ? )

python build.py --model_dir ${BASE_LLAMA_MODEL} \
              ....
                --hf_lora_dir "Japanese-Alpaca-LoRA-7b-v0/" \

In engine building, we only use one of lora models to get the common parameters, and users need to setup the max_lora_rank properly based on their lora models. Most lora logics are handled in runtime and that's why we only load multiple lora models in runtime.

Baboom-l commented 3 months ago

@byshiue What is the implementation of LoRA dynamic switching in TensorRT? Shouldn't it be static after being converted to an engine

byshiue commented 3 months ago

The LoRA weights are managed by runtime instead of engine. We pass the pointers of LoRA weights as inputs of the TRT engine. So, we could pass different pointers to switch the LoRA weights dynamically.