Demo notebook is almost completelly useless "Accelerating" HuggingFace T5 Inference with TensorRT

Oxi84 commented 2 years ago

I went throught notebook (https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/notebooks/t5.ipynb) - Accelerating HuggingFace T5 Inference with TensorRT and i found it almost completely useless.

I mean it does work faster using TensorRT, around 2x faster on rtx 2080 for batch size 1, but there is not example using beam search and there is no example with larger batch size.

When i add 2 item, so the batch size is 2 it does not work, So is much slower than non TensorRT version, which can use larger batch sizes and beam search to get higher quality results.

victox5 commented 2 years ago

I would not say useless as it has shown value in the optimization of T5, but it is true that the limitation to batch_size=1 is not allowing a production-ready implementation. In my case, for 1 sample it has a x5 improvement, so I see potential if we can make it work for batches.

I've been trying to modify the batch_size value but results were not making sense for the rest of the samples in the batch except the first one. Then I saw it is concatenated and flattened so I tried to keep full shapes. Not successful so far, I was a bit confused on the binding part of self.trt_context.

It would be great if someone can show how to run batch inference in the trt model!

pommedeterresautee commented 2 years ago

In my experience on encoder only transformer arch, TensorRT works less well when you have 2 dynamic axis (one is Ok, all fixed provides best perf). Obviously on generative models (decoder only, or enc+dec) seq len axis can't be fixed. On GPT2 if I keep fix batch size for any value (1 or more), TRT is super fast (like 50% faster than ONNX Runtime with binding_io API). I imagine it will be the same on T5. To make batch inference with TRT it required a rewriting of the TRT demo code from this repo. I will push something there in the coming days if you are interested: https://github.com/ELS-RD/transformer-deploy

vblagoje commented 2 years ago

@pommedeterresautee Michaël I'd love to follow up on these changes and see how they can be transferred to BART. Speeding up BART (and other enc+dec archs) GenerationMixin.generate method would be absolutely amazing.

pommedeterresautee commented 2 years ago

@vblagoje have a check here https://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/gpt2.ipynb

Basically:

the key is to limit IO between host and GPU, don't forget that output for each token = 4 bytes batch seq len * vocab size, with vocab size > 50K dims (GPT2 tokenizer, may be worse with other models), naive implementation on ORT or TRT is much slower than Pytorch because of that
caching k and q vectors makes things slower because of array copy Vs just recomputing them (on a 3090 RTX, may be different on a T4, requires tests), for both short and long seq. Good news because caching forces us to have 2 models (one with caching and one without for the first generation) OR to include the decoding algo inside the model (ORT do that, but it's just beam search)
Triton adds some overhead on top of the inf engine even when using DLPack for 0 tensor copy between engine / Python code.

should be the same for t5 and bart

The good news, even in mixed precision the generated seq is the same than Pytorch after 256 tokens (at least on the NVidia prompt).

Right now I just need to code the Triton configuration stuff but other parts are ok.

ttyio commented 1 year ago

This discussion is helpful, mark as good reference, and closing since it is not active for a long time, thanks all!

NVIDIA / TensorRT

Demo notebook is almost completelly useless "Accelerating" HuggingFace T5 Inference with TensorRT #1642