[Feature Request] optional tracing(onnx etc.) based solution inside trt-llm

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

8.81k stars 1.01k forks source link

Currently, the model definition in trt-llm is mainly manually built through TensorRT's API or plugins. While this provides flexibility, an optional tracing based (mainly onnx) solution could enable support for those less familiar with trt-llm structure or non-generic model structures.

To illustrate, taking llama2 as an example, the model can be roughly divided into two parts: batchful and batchless. Model parameters are mainly located in the batchful part, whereas the batchless part consists of positional encoding and parameter-free attention. By treating Attention as an independent ONNX plugin, it is possible to export a standalone batch-able ONNX. The Attention node can be implemented using a TensorRT plugin. This plugin essentially directs the batchless part to a dedicated server. Subsequently, scheduling (contiguous batching/paged attention/vattention) can be performed independently of the TensorRT system.

A very early prototype can be found here, but it involves a substantial amount of work, which is beyond our capacity and might be considered by the TRT-LLM community, we believe.

Is such an idea feasible?

NVIDIA A10/llama2-7b fp16:	Option	vllm(MEAN/MS)	trt+schedule
qps=2,num_samples=50	TTFT 107, TPOT 44	TTFT 121, TPOT 48
qps=2,num_samples=500	-	OOM

NVIDIA A10/llama2-7b fp16:

Option

vllm(MEAN/MS)

trt+schedule

qps=2,num_samples=50

TTFT 107, TPOT 44

TTFT 121, TPOT 48

qps=2,num_samples=500

OOM

NVIDIA / TensorRT-LLM

[Feature Request] optional tracing(onnx etc.) based solution inside trt-llm #2519