TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Could you explain your motivation to setting it as 0 instead of 1 << 30? 1 << 30 should work equivalent to 0 and it is more friend to our kernel implementation.
System Info
GPU-A100, TensorRT-LLM version = tensorrt_llm-0.13.0.dev2024090300 Ubuntu machine.
Who can help?
hi @ncomly-nvidia , @byshiue ,
I want to set the 'no_repeat_ngram_size'=0 for mistral model. But I get the following assertion error:
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: noRepeatNgramSize.value() > 0 (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/executor/samplingConfig.cpp:332)
As per the documentation the default value is 1 << 30, is there way to set the value to 0? If not, can this feature be added?
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Setting no_repear_ngram_size=0 under SamplingParams for mistral model.
Expected behavior
User should be allowed to allowed to set this value to 0.
actual behavior
Getting assertion error.
additional notes
We want to set it to 0 like we do for pytorch-eager used for inference.