NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.81k stars 1.01k forks source link

[Feature Request] optional tracing(onnx etc.) based solution inside trt-llm #2519

Open tp-nan opened 1 day ago

tp-nan commented 1 day ago

Currently, the model definition in trt-llm is mainly manually built through TensorRT's API or plugins. While this provides flexibility, an optional tracing based (mainly onnx) solution could enable support for those less familiar with trt-llm structure or non-generic model structures.

To illustrate, taking llama2 as an example, the model can be roughly divided into two parts: batchful and batchless. Model parameters are mainly located in the batchful part, whereas the batchless part consists of positional encoding and parameter-free attention. By treating Attention as an independent ONNX plugin, it is possible to export a standalone batch-able ONNX. The Attention node can be implemented using a TensorRT plugin. This plugin essentially directs the batchless part to a dedicated server. Subsequently, scheduling (contiguous batching/paged attention/vattention) can be performed independently of the TensorRT system.

A very early prototype can be found here, but it involves a substantial amount of work, which is beyond our capacity and might be considered by the TRT-LLM community, we believe.

Is such an idea feasible?

tp-nan commented 1 day ago

PS. Preliminary benchmarks indicate that comparable performance to vllm can be achieved at low sequence length complexity even with pure TensorRT and not well optimized configuration:

NVIDIA A10/llama2-7b fp16: Option vllm(MEAN/MS) trt+schedule
qps=2,num_samples=50 TTFT 107, TPOT 44 TTFT 121, TPOT 48
qps=2,num_samples=500 - OOM

by utilizing a) contiguous batch, b) stream-ordered memory reuse, c) a batchful model employing 5 IOptimizationProfiles with shared memory and non-overlopped range , and d) four parallel batchless IOptimizationProfile instances