NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.74k stars 2.12k forks source link

Demo notebook is almost completelly useless "Accelerating" HuggingFace T5 Inference with TensorRT #1642

Closed Oxi84 closed 1 year ago

Oxi84 commented 2 years ago

I went throught notebook (https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/notebooks/t5.ipynb) - Accelerating HuggingFace T5 Inference with TensorRT and i found it almost completely useless.

I mean it does work faster using TensorRT, around 2x faster on rtx 2080 for batch size 1, but there is not example using beam search and there is no example with larger batch size.

When i add 2 item, so the batch size is 2 it does not work, So is much slower than non TensorRT version, which can use larger batch sizes and beam search to get higher quality results.

victox5 commented 2 years ago

I would not say useless as it has shown value in the optimization of T5, but it is true that the limitation to batch_size=1 is not allowing a production-ready implementation. In my case, for 1 sample it has a x5 improvement, so I see potential if we can make it work for batches.

I've been trying to modify the batch_size value but results were not making sense for the rest of the samples in the batch except the first one. Then I saw it is concatenated and flattened so I tried to keep full shapes. Not successful so far, I was a bit confused on the binding part of self.trt_context.

It would be great if someone can show how to run batch inference in the trt model!

pommedeterresautee commented 2 years ago

In my experience on encoder only transformer arch, TensorRT works less well when you have 2 dynamic axis (one is Ok, all fixed provides best perf). Obviously on generative models (decoder only, or enc+dec) seq len axis can't be fixed. On GPT2 if I keep fix batch size for any value (1 or more), TRT is super fast (like 50% faster than ONNX Runtime with binding_io API). I imagine it will be the same on T5. To make batch inference with TRT it required a rewriting of the TRT demo code from this repo. I will push something there in the coming days if you are interested: https://github.com/ELS-RD/transformer-deploy

vblagoje commented 2 years ago

@pommedeterresautee Michaël I'd love to follow up on these changes and see how they can be transferred to BART. Speeding up BART (and other enc+dec archs) GenerationMixin.generate method would be absolutely amazing.

pommedeterresautee commented 2 years ago

@vblagoje have a check here https://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/gpt2.ipynb

Basically:

should be the same for t5 and bart

The good news, even in mixed precision the generated seq is the same than Pytorch after 256 tokens (at least on the NVidia prompt).

Right now I just need to code the Triton configuration stuff but other parts are ok.

ttyio commented 1 year ago

This discussion is helpful, mark as good reference, and closing since it is not active for a long time, thanks all!