Open tp-nan opened 1 day ago
PS. Preliminary benchmarks indicate that comparable performance to vllm can be achieved at low sequence length complexity even with pure TensorRT and not well optimized configuration:
NVIDIA A10/llama2-7b fp16: | Option | vllm(MEAN/MS) | trt+schedule |
---|---|---|---|
qps=2,num_samples=50 | TTFT 107, TPOT 44 | TTFT 121, TPOT 48 | |
qps=2,num_samples=500 | - | OOM |
by utilizing a) contiguous batch, b) stream-ordered memory reuse, c) a batchful model employing 5 IOptimizationProfiles with shared memory and non-overlopped range , and d) four parallel batchless IOptimizationProfile instances
Currently, the model definition in trt-llm is mainly manually built through TensorRT's API or plugins. While this provides flexibility, an optional tracing based (mainly onnx) solution could enable support for those less familiar with trt-llm structure or non-generic model structures.
To illustrate, taking llama2 as an example, the model can be roughly divided into two parts: batchful and batchless. Model parameters are mainly located in the batchful part, whereas the batchless part consists of positional encoding and parameter-free attention. By treating Attention as an independent ONNX plugin, it is possible to export a standalone batch-able ONNX. The Attention node can be implemented using a TensorRT plugin. This plugin essentially directs the batchless part to a dedicated server. Subsequently, scheduling (contiguous batching/paged attention/vattention) can be performed independently of the TensorRT system.
A very early prototype can be found here, but it involves a substantial amount of work, which is beyond our capacity and might be considered by the TRT-LLM community, we believe.
Is such an idea feasible?