Open ashissamal opened 2 years ago
for GPU you can use the onnxruntime-gpu
library, but it does not support quantization. so you won't have the advantage of reduced model size during inference.
here's an example implementation of this library for BERT, you can follow this guide and make suitable changes for T5. In addition to this you also need to implement iobinding
.
I tried without iobinding but wasn't able to get any advantages over pytorch.
I would also check out this recent demo that NVIDIA did of TensorRT, which involves converting to ONNX as an intermediate step. They run the tests on GPU https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/README.md
ONNX Runtime supports GPU quantization through the TensorRT provider (now embedded by default in the GPU version of the pipy lib, no need for a custom compilation). However it only supports PTQ, meaning there is a 2-3 point accuracy cost (VS QAT or dynamic quantization which are usually close to non quantized accuracy). Quantization brings a X2 speedup, that you can add to a 1.3 speedup when you switch from ORT to TRT, so quite significant on base / large models (not yet benchmarked on distilled models)
Hopefully, QAT is also doable but requires some work per model (modify the attention part to add QDQ). You can see some here https://github.com/ELS-RD/transformer-deploy/pull/29, for now only for Albert, Electra, Bert, Roberta and Distilbert. I will probably add support for Deberta V1 and V2, T5 and Bart as a next step.
I really appreciate the functionality that the fastT5 library offers!
Like the original poster, I am looking to leverage the speedup from both ONNX Runtime and quantization that fastT5 offers, and deploy this on a Nvidia GPU. Do you have any pointers on how to accomplish this with a t5-large model?
@Ki6an or @sam-writer Are there plans to add GPU support for this library?
Thanks!
~@nbravulapalli yes, there are plans to make running on GPU as easy as running on CPU is currently. However, if you need to run on GPU now, your best bet is probably to follow this notebook to convert the model to TensorRT format, which runs on GPU faster than quantized ONNX t5 runs on CPU.~
I know understand @pommedeterresautee's comment. you do not need to convert to TRT format to use TRT. you can convert to ONNX format, then per the onnx docs, you can use the TRT execution provider
import onnxruntime as ort
# set providers to ['TensorrtExecutionProvider', 'CUDAExecutionProvider'] with TensorrtExecutionProvider having the higher priority.
sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider'])
This might be fast enough because TRT gives a 1.3x speed boost. But if you want the 2x speed boost of quantization, you take an accuracy hit. In fastt5
quantization doesn't hurt accuracy because we use dynamic quantization, which AFAICT isn't an option on GPU yet, you'd be using PTQ instead, which does hurt accuracy.
Another consideration on GPU that isn't a factor on CPU is iobinding
, basically that coping values back and forth to the GPU takes time, and should be minimized. Not getting iobinding right can cause a perf hit
Here is an example from a branch of the ONNX library that demonstrates using io-binding
as well as other tricks needed to run on GPU link
@sam-writer So currently to get a performance optimization for inference time on T5 on GPU, do you recommend this code?
Here is an example from a branch of the ONNX library that demonstrates using
io-binding
as well as other tricks needed to run on GPU link
Is there an example-code of real use?
I would like to improve the GPU inference time with a T5-base with max_length=1024
.
Thanks for sharing the repo . It is really helpful.
I'm exploring ways to do the optimization on GPU. I know its not presently supported. Could you share some approach or references to implement the optimization on GPU(Nvidia)