TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Are these being used by default while building engine files? What are their default values? I am majorly concerned about max_input_len, max_output_len, max_batch_size, paged_kv_cache.
Hi team,
We have been using 0.7.1 and now upgrading to 0.10.0. While running the step, I get the below error:
convert_checkpoint.py: error: unrecognized arguments: --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_rmsnorm_plugin float16 --enable_context_fmha --world_size 8 --use_inflight_batching --max_input_len 4096 --max_output_len 1024 --max_batch_size 8 --paged_kv_cache
These used to the part of
build.py
.Are these being used by default while building engine files? What are their default values? I am majorly concerned about
max_input_len
,max_output_len
,max_batch_size
,paged_kv_cache
.Thanks!