NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.62k stars 2.11k forks source link

Wrong outputs of converted mt5 models #1613

Open tobigue opened 2 years ago

tobigue commented 2 years ago

Description

I'm trying to convert a mt5 model to TensorRT. I adapted the T5 demo notebook of the main branch, however, the outputs of the TensorRT model are not what they should be. (The original T5 notebook works as expected. The environment is a docker container build from the main branch of this repository.)

To be able do experiment, I basically copied the demo/HuggingFace/T5 folder to demo/HuggingFace/MT5. The changes I did in there are the usage of the MT5 classes of huggingface (e.g. MT5Config instead of T5Config) and I added the correct parameters for the mt5 models to the MT5ModelTRTConfig.

As you can see at the bottom of the adapted notebook, the outputs of the TensorRT model are not as expected, resulting in generation of nonsense-text: mt5-small-sum.ipynb. Further above, I also verified that the exported ONNX model does output the same as the pytorch model, so the problem seems to be with TensorRT.

As the model code of mT5 is basically the same as for T5, I don't understand why the conversion does not seem to work correctly for the mT5 models (besides the public T-Systems-onsite/mt5-small-sum-de-en-v2 model used in the commited notebook, I also tried it with a private fine-tuned mt5-small model, which didn't work either).

I would be very thankful for any ideas why this could be happening or any tips how to debug this further. Cheers!

cc @vinhngx @parthchadha @rajeevsrao

Environment

TensorRT Version: 8.2.0.6 NVIDIA GPU: RTX 3090 NVIDIA Driver Version: 470.57.02 CUDA Version: 11.4, V11.4.120 CUDNN Version: 8.2.4 Operating System: ubuntu 20.04 Python Version (if applicable): 3.8.10 Tensorflow Version (if applicable): 2.5.1 PyTorch Version (if applicable): 1.9.1+cu111 (cudnn 8005) Baremetal or Container (if so, version): Docker container build from the main branch of this repository

Relevant Files

Steps To Reproduce

tobigue commented 2 years ago

It seems that I have found the issue.

As it turns out, the vocabulary size of t5 is hardcoded in the trt.py. When this is set to the correct vocabulary size of the model, the outputs are as expected.

vinhngx commented 2 years ago

Thanks @tobigue for reporting this and the solution. Pls keep us posted on how you go with this model.

tobigue commented 2 years ago

Hey @vinhngx, thank you for answering.

When testing inference with the converted mT5-small model with different samples, I encountered a bit strange behaviour. While shorter samples are up to 4x faster on TensorRT, longer samples are up to 2.5x slower compared to the pytorch model.

This was on a RTX 3090 with a custom fine-tuned mT5-small model (translation task, so output length should be roughly equal to input length).

image image

My first thought was, that the TensorRT model gets slower because the model does not use a past_key_values cache for the decoder. However, in the demo code the greedy_search function is used for generation and I could not verify that the pytorch model actually uses the cache. Also, setting variables like use_cache=False for the pytorch model did not have any effects on its performance. So it is unclear to me, why the pytorch model seems to scale linearly with input length and the TensorRT model seems not to?

Another idea I had was, that maybe it has to do with the generated profiles for the TensorRT engine. The "optimal" shape is set to half the maximum sequence length. So I tried setting opt=max, but the resulting engine did not behave differently.

I also tried to reproduce the behaviour with the converted t5-small model of the initial demo notebook via the t5-playground.ipynb notebook, by feeding the model longer samples to translate. However, for long inputs the model did never translate the whole input (?), but always stopped generation early, so I couldn't get meaningful measurements for long samples.

If you or anyone has ideas why we see this behaviour for long samples with the TensorRT version, I'd be very happy.

Cheers!

cc @parthchadha @rajeevsrao

vinhngx commented 2 years ago

Thanks @tobigue for the comprehensive analysis. Based on our internal experimentation with T5-small, with very large sequence lengths, TRT is still faster than frameworks though the speedup drops from 2.5x to ~1.1x. We are working on resolving this issue in future TensorRT releases.