TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Hello, I have a couple of determinism-related questions. Thank you very much in advance for your insight. At a high level, my goal is to get the highest-throughput deterministic model generations that I can, understanding that I will be reducing throughput in order to achieve determinism.
I understand that TensorRT-LLM engines will select different kernels based on batch size, thus we expect input a given prompt text to have different generation logits when run at different batch sizes (e.g. [text] vs [text, text, text]).
I have been able to get deterministic generation by using the inflight batcher with max_batch_size=1, however this is highly inefficient since many of my generated sequences are long.
I was considering statically batching, however it seems like this will not be supported in the future given the deprecation of the V1 batcher. Given this, what is currently the "best" way to get high-throughput deterministic generations?
I understand that TensorRT engine builds can be non-deterministic due to system noise while the builder is profiling different kernels. I have not yet done rigorous testing, but I'm getting reproducible results on a 8xH100 server that is simultaneously running other workloads. Are there steps taken in TensorRT-LLM to ensure a deterministic engine build, or am I just luckily getting reproducible results?
Are there any flags I can set during the build stage that will encourage reproducible engine builds, even if a purely deterministic outcome cannot be guaranteed?
Hello, I have a couple of determinism-related questions. Thank you very much in advance for your insight. At a high level, my goal is to get the highest-throughput deterministic model generations that I can, understanding that I will be reducing throughput in order to achieve determinism.
text
to have different generation logits when run at different batch sizes (e.g.[text]
vs[text, text, text]
).