NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

Migrating engine build process from 0.7.1 to 0.10.0 #1840

Closed tapansstardog closed 3 days ago

tapansstardog commented 3 days ago

Hi team,

We have been using 0.7.1 and now upgrading to 0.10.0. While running the step, I get the below error:

convert_checkpoint.py: error: unrecognized arguments: --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_rmsnorm_plugin float16 --enable_context_fmha --world_size 8 --use_inflight_batching --max_input_len 4096 --max_output_len 1024 --max_batch_size 8 --paged_kv_cache

These used to the part of build.py.

Are these being used by default while building engine files? What are their default values? I am majorly concerned about max_input_len, max_output_len, max_batch_size, paged_kv_cache.

Thanks!

tapansstardog commented 3 days ago

Closing it as I realized trtllm-build does the task.