Open tianleiwu opened 3 years ago
The link at the bottom of the issue is dead, I think the appropriate link is now ONNX Runtime IOBinding.
@tianleiwu did you ever successfully do this?
你好! 请问现在GPU上进行T5的推理,是不是onnxruntime和pytorch的速度都差不多啊?在decode的过程中,每次decode的结果past value都很大,用onnxruntime推理怎么减少IO呢? Hello! Now that the inference of T5 on the GPU is similar, is the speed of onnxruntime and pytorch similar? In the process of decode, the past value of each decode result is large, how to reduce IO with onnxruntime reasoning?@tianleiwu
@shiqingzhangCSU, To reduce I/O, it need design of special CUDA kernels (and also integrates with BeamSearch operator) to deal with past state. In Onnx Runtime, @wangyems is working on optimizations for T5. It is very close to finish.
You can try out current optimizations (although optimizations are still on-going): https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/convert_generation.py
In current benchmark results, ONNX is slower than PyTorch above 500 words. I think the cause is the OnnxRuntime API used in inference: https://github.com/abelriboulot/onnxt5/blob/284474952bcb10521a0b0132c677f61981ab2a1c/onnxt5/models.py#L121
For GPU inference, that API need extra memory copy (from CPU to GPU for input tensors, and from GPU to CPU for output tensors). When sequence length is large, the IO latency might be significant.
I suggest to try OnnxRuntime IO Binding to avoid extra memory copy.