NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

Unable to create Llama Engine with Lora Weights #938

Open rajeevbaalwan opened 7 months ago

rajeevbaalwan commented 7 months ago

System Info

Who can help?

@byshiue please help

Information

Tasks

Reproduction

I have a PEFT trained Llama model which i want to run via TensorRT-LLM. I am using below build command to create the engine.

python build.py --model_dir /mnt/hdd1/rajeevy/nlp/LLM/summarization/llama2-original                 --dtype bfloat16                  --use_gpt_attention_plugin bfloat16                 --enable_context_fmha                 --use_gemm_plugin bfloat16     --use_lora_plugin bfloat16            --output_dir /app/tensorrt_llm/examples/llama/7B/trt_engines/bfp16/2-gpu/      --hf_lora_dir /mnt/hdd1/rajeevy/nlp/LLM/summarization/llama2-16jan/checkpoint-24690  --remove_input_padding           --world_size 2                 --tp_size 2

Expected behavior

Model should have build without any error

actual behavior

I am getting below error:

Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/llama/build.py", line 983, in <module>
    build(0, args)
  File "/app/tensorrt_llm/examples/llama/build.py", line 927, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/app/tensorrt_llm/examples/llama/build.py", line 836, in build_rank_engine
    inputs = tensorrt_llm_llama.prepare_inputs(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 513, in prepare_inputs
    model_inputs = self.prepare_basic_inputs(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/generation_mixin.py", line 521, in prepare_basic_inputs
    for lora_module in lora_target_modules:
TypeError: 'NoneType' object is not iterable

additional notes

I have build the docker container using make -C docker release_build command.

Contents of llama2-original folder:

-rw-rw-r-- 1 1002 1002  183 Jan 23 07:40 generation_config.json
-rw-rw-r-- 1 1002 1002 4.6G Jan 23 07:40 model-00001-of-00006.safetensors
-rw-rw-r-- 1 1002 1002 4.6G Jan 23 07:41 model-00002-of-00006.safetensors
-rw-rw-r-- 1 1002 1002 4.6G Jan 23 07:41 model-00003-of-00006.safetensors
-rw-rw-r-- 1 1002 1002 4.6G Jan 23 07:41 model-00004-of-00006.safetensors
-rw-rw-r-- 1 1002 1002 4.6G Jan 23 07:41 model-00005-of-00006.safetensors
-rw-rw-r-- 1 1002 1002 2.6G Jan 23 07:41 model-00006-of-00006.safetensors
-rw-rw-r-- 1 1002 1002  24K Jan 23 07:41 model.safetensors.index.json
byshiue commented 7 months ago

You need to set the lora_target_module during building engine, please check the document of llama example.