abelriboulot / onnxt5

Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.
Apache License 2.0
252 stars 30 forks source link

Use OnnxRuntime IO Binding to improve GPU inference performance #11

Open tianleiwu opened 3 years ago

tianleiwu commented 3 years ago

In current benchmark results, ONNX is slower than PyTorch above 500 words. I think the cause is the OnnxRuntime API used in inference: https://github.com/abelriboulot/onnxt5/blob/284474952bcb10521a0b0132c677f61981ab2a1c/onnxt5/models.py#L121

For GPU inference, that API need extra memory copy (from CPU to GPU for input tensors, and from GPU to CPU for output tensors). When sequence length is large, the IO latency might be significant.

I suggest to try OnnxRuntime IO Binding to avoid extra memory copy.

sam-writer commented 3 years ago

The link at the bottom of the issue is dead, I think the appropriate link is now ONNX Runtime IOBinding.

@tianleiwu did you ever successfully do this?

shiqingzhangCSU commented 1 year ago

你好! 请问现在GPU上进行T5的推理,是不是onnxruntime和pytorch的速度都差不多啊?在decode的过程中,每次decode的结果past value都很大,用onnxruntime推理怎么减少IO呢? Hello! Now that the inference of T5 on the GPU is similar, is the speed of onnxruntime and pytorch similar? In the process of decode, the past value of each decode result is large, how to reduce IO with onnxruntime reasoning?@tianleiwu

tianleiwu commented 1 year ago

@shiqingzhangCSU, To reduce I/O, it need design of special CUDA kernels (and also integrates with BeamSearch operator) to deal with past state. In Onnx Runtime, @wangyems is working on optimizations for T5. It is very close to finish.

You can try out current optimizations (although optimizations are still on-going): https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/convert_generation.py