Ki6an / fastT5

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.
Apache License 2.0
562 stars 72 forks source link

GPU Optimization #34

Open ashissamal opened 2 years ago

ashissamal commented 2 years ago

Thanks for sharing the repo . It is really helpful.

I'm exploring ways to do the optimization on GPU. I know its not presently supported. Could you share some approach or references to implement the optimization on GPU(Nvidia)

Ki6an commented 2 years ago

for GPU you can use the onnxruntime-gpu library, but it does not support quantization. so you won't have the advantage of reduced model size during inference.

here's an example implementation of this library for BERT, you can follow this guide and make suitable changes for T5. In addition to this you also need to implement iobinding . I tried without iobinding but wasn't able to get any advantages over pytorch.

sam-writer commented 2 years ago

I would also check out this recent demo that NVIDIA did of TensorRT, which involves converting to ONNX as an intermediate step. They run the tests on GPU https://github.com/NVIDIA/TensorRT/blob/main/demo/HuggingFace/README.md

pommedeterresautee commented 2 years ago

ONNX Runtime supports GPU quantization through the TensorRT provider (now embedded by default in the GPU version of the pipy lib, no need for a custom compilation). However it only supports PTQ, meaning there is a 2-3 point accuracy cost (VS QAT or dynamic quantization which are usually close to non quantized accuracy). Quantization brings a X2 speedup, that you can add to a 1.3 speedup when you switch from ORT to TRT, so quite significant on base / large models (not yet benchmarked on distilled models)

Hopefully, QAT is also doable but requires some work per model (modify the attention part to add QDQ). You can see some here https://github.com/ELS-RD/transformer-deploy/pull/29, for now only for Albert, Electra, Bert, Roberta and Distilbert. I will probably add support for Deberta V1 and V2, T5 and Bart as a next step.

nbravulapalli commented 2 years ago

I really appreciate the functionality that the fastT5 library offers!

Like the original poster, I am looking to leverage the speedup from both ONNX Runtime and quantization that fastT5 offers, and deploy this on a Nvidia GPU. Do you have any pointers on how to accomplish this with a t5-large model?

@Ki6an or @sam-writer Are there plans to add GPU support for this library?

Thanks!

sam-writer commented 2 years ago

~@nbravulapalli yes, there are plans to make running on GPU as easy as running on CPU is currently. However, if you need to run on GPU now, your best bet is probably to follow this notebook to convert the model to TensorRT format, which runs on GPU faster than quantized ONNX t5 runs on CPU.~

I know understand @pommedeterresautee's comment. you do not need to convert to TRT format to use TRT. you can convert to ONNX format, then per the onnx docs, you can use the TRT execution provider

import onnxruntime as ort
# set providers to ['TensorrtExecutionProvider', 'CUDAExecutionProvider'] with TensorrtExecutionProvider having the higher priority.
sess = ort.InferenceSession('model.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider'])

This might be fast enough because TRT gives a 1.3x speed boost. But if you want the 2x speed boost of quantization, you take an accuracy hit. In fastt5 quantization doesn't hurt accuracy because we use dynamic quantization, which AFAICT isn't an option on GPU yet, you'd be using PTQ instead, which does hurt accuracy.

Another consideration on GPU that isn't a factor on CPU is iobinding, basically that coping values back and forth to the GPU takes time, and should be minimized. Not getting iobinding right can cause a perf hit

sam-writer commented 2 years ago

Here is an example from a branch of the ONNX library that demonstrates using io-binding as well as other tricks needed to run on GPU link

GenVr commented 2 years ago

@sam-writer So currently to get a performance optimization for inference time on T5 on GPU, do you recommend this code?

Here is an example from a branch of the ONNX library that demonstrates using io-binding as well as other tricks needed to run on GPU link

Is there an example-code of real use?

I would like to improve the GPU inference time with a T5-base with max_length=1024.