NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.19k stars 908 forks source link

Cannot build Nougat model #1088

Open mtenenholtz opened 7 months ago

mtenenholtz commented 7 months ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

Following the instructions for Nougat here: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#nougat

Error happens during the build.py step.

python ../enc_dec/build.py \
    --model_type bart \
    --weight_dir tmp/trt_models/${MODEL_NAME}/tp1 \
    -o trt_engines/${MODEL_NAME}/1-gpu \
    --engine_name $MODEL_NAME \
    --bert_attention_plugin \
    --use_gpt_attention_plugin \
    --use_gemm_plugin \
    --dtype bfloat16 \
    --max_beam_width 1 \
    --max_batch_size 1 \
    --nougat \
    --max_output_len 100 \
    --max_multimodal_len 588

Expected behavior

Model successfully builds

actual behavior

[02/15/2024-19:53:13] [TRT-LLM] [W] Skipping build of encoder for Nougat model
[02/15/2024-19:53:13] [TRT-LLM] [I] Setting model configuration from tmp/trt_models/nougat-small/tp1.
[02/15/2024-19:53:13] [TRT-LLM] [I] use_bert_attention_plugin set, without specifying a value. Using bfloat16 automatically.
[02/15/2024-19:53:13] [TRT-LLM] [I] use_gpt_attention_plugin set, without specifying a value. Using bfloat16 automatically.
[02/15/2024-19:53:13] [TRT-LLM] [I] use_gemm_plugin set, without specifying a value. Using bfloat16 automatically.
[02/15/2024-19:53:13] [TRT-LLM] [W] Forcing max_encoder_input_len equal to max_prompt_embedding_table_size
[02/15/2024-19:53:13] [TRT-LLM] [I] Serially build TensorRT engines.
[02/15/2024-19:53:13] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 121, GPU 404 (MiB)
[02/15/2024-19:53:14] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1809, GPU +316, now: CPU 2066, GPU 720 (MiB)
[02/15/2024-19:53:14] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[02/15/2024-19:53:14] [TRT-LLM] [I] Loading weights from binary...
[02/15/2024-19:53:14] [TRT-LLM] [I] Weights loaded. Total time: 00:00:00
Traceback (most recent call last):
  File "/home/mark/projects/searchresearch/TensorRT-LLM/examples/multimodal/../enc_dec/build.py", line 574, in <module>
    run_build(component='decoder')
  File "/home/mark/projects/searchresearch/TensorRT-LLM/examples/multimodal/../enc_dec/build.py", line 565, in run_build
    build(0, args)
  File "/home/mark/projects/searchresearch/TensorRT-LLM/examples/multimodal/../enc_dec/build.py", line 509, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/home/mark/projects/searchresearch/TensorRT-LLM/examples/multimodal/../enc_dec/build.py", line 402, in build_rank_engine
    network.plugin_config.to_legacy_setting()
AttributeError: 'PluginConfig' object has no attribute 'to_legacy_setting'

additional notes

It looks like the to_legacy_settings() method doesn't exist in the builder class.

symphonylyh commented 6 months ago

HI @mtenenholtz , did you do a full update to the latest main? I do see the PluginConfig class has this attribute: https://github.com/NVIDIA/TensorRT-LLM/blob/3c373ebc5b5caf7e41198125131a153f3df08f09/tensorrt_llm/plugin/plugin.py#L135