[Nougat]Accuracy Problem: different output for both float32 and bfloat 16 trtllm engine with float32 huggingface original model

System Info

Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35 Python version: 3.10.12 PyTorch version (GPU?): 2.4.0+cu121 (True) [TensorRT-LLM] TensorRT-LLM version: 0.12.0 Driver Version: 535.161.08 CUDA Version: 12.5 GPU: A40 single card

Who can help?

@byshiue

Information

[x] The official example scripts
[x] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Follow the tutorial of TensorRT-LLM Linux install the docker and TensorRT-LLM 0.12.0 https://nvidia.github.io/TensorRT-LLM/installation/linux.html
Download my own finetune_version of Nougat, which has the same architecture with nougat base 0.1.0, only change the model weights. # Clone finetune-version of nougat model git lfs install git clone https://huggingface.co/shenzhanyou/table_nougat and copy the model to examples/multimodal/tmp/hf_models/${MODEL_NAME} to align with the official example script.

Follow the tutorial of Nougat and transform the original model above to bfloat16 and float32 version. (Only show the bfloat16 cmd, you can replace bfloat16 with float32 to check float32 accuracy)

python ../enc_dec/convert_checkpoint.py --model_type bart \
    --model_dir tmp/hf_models/${MODEL_NAME} \
    --output_dir tmp/trt_models/${MODEL_NAME}/bfloat16 \
    --tp_size 1 \
    --pp_size 1 \
    --dtype bfloat16 \
    --nougat

trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/decoder \
    --output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16/decoder \
    --paged_kv_cache disable \
    --moe_plugin disable \
    --enable_xqa disable \
    --gemm_plugin bfloat16 \
    --bert_attention_plugin bfloat16 \
    --gpt_attention_plugin bfloat16 \
    --remove_input_padding enable \
    --max_beam_width 1 \
    --max_batch_size 1 \
    --max_seq_len 101 \
    --max_input_len 1 \
    --max_encoder_input_len 588 # 1 (max_batch_size) * 588 (num_visual_features)

python build_visual_engine.py --model_type nougat --model_path tmp/hf_models/${MODEL_NAME}

Only replace the test image in examples/multimodal/run.py with my own image below to check result. python run.py \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \ --llm_engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16

Expected behavior

the output of original version of NougatModel is:

\begin{tabular}{@{}llcccccccc@{}}
\hline\hline
\textsf{Dataset} & \textsf{Reference} & \textsf{Resolution} & \textsf{SS} & \textsf{NC} & \textsf{AD} & \textsf{TRT} & \textsf{ML} & \textsf{MP} & \textsf{DM} \\
 \hline
\textsf{\textsf{\textsf{DVS}-Gesture}} & \textsf{[56]} & \textsf{128\,$\times$\,128} & \textsf{1342} & \textsf{11} & \textsf{5s} & \textsf{1.86h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{SL-Animal}-DVS}} & \textsf{[57]} & \textsf{128\,$\times$\,128} & \textsf{1121} & \textsf{19} & \textsf{4.5s} & \textsf{1.4h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{Action}Recognition}} & \textsf{[58]} & \textsf{260\,$\times$\,346} & \textsf{291} & \textsf{10} & \textsf{5s} & \textsf{0.4h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{DailyAction}}} & \textsf{[37]} & \textsf{260\,$\times$\,346} & \textsf{1440} & \textsf{12} & \textsf{5s} & \textsf{2h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\hline
\textsf{\textsf{\textsf{\textsf{DVS}-SLR}} (Ours)} & \textsf{-} & \textsf{260\,$\times$\,346} & \textsf{5418} & \textsf{21} & \textsf{6s} & \textsf{9.03h} & \textsf{$\surd$} & \textsf{$\surd$} & \textsf{$\surd$} \\
\hline
\end{tabular}

actual behavior

bfloat16 and float32 trt_engine gives the same output below.

\begin{tabular}{@{}lllllllllll@{}}
\hline\hline
\textsf{Dataset} & \textsf{Reference} & \textsf{Resolution} & \textsf{SS} & \textsf{NC} & \textsf{AD} & \textsf{TRT} & \textsf{ML} & \textsf{MP} & \textsf{DM} \\
 \hline
\textsf{\textsf{\textsf{DVS-Gesture}}} & \textsf{[56]} & \textsf{128\,$\times$\,128} & \textsf{1342} & \textsf{11} & \textsf{5s} & \textsf{1.86h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{SL-Animal-DVS}}} & \textsf{[57]} & \textsf{128\,$\times$\,128} & \textsf{1121} & \textsf{19} & \textsf{4.5s} & \textsf{1.4h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{ActionRecognition}}} & \textsf{[58]} & \textsf{260\,$\times$\,346} & \textsf{291} & \textsf{10} & \textsf{5s} & \textsf{0.4h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{DailyAction}}} & \textsf{[37]} & \textsf{260\,$\times$\,346} & \textsf{1440} & \textsf{12} & \textsf{5s} & \textsf{2h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\hline
\textsf{\textsf{\textsf{DVS-SLR}}} (\textsf{Ours}) & \textsf{-} & \textsf{260\,$\times$\,346} & \textsf{5418} & \textsf{21} & \textsf{6s} & \textsf{9.03h} & \textsf{$\surd$} & \textsf{$\surd$} & \textsf{$\surd$} \\
\hline
\end{tabular}

which is different from the trtllm inference result with the first line: \begin{tabular}{@{}llcccccccc@{}} (original transformer) and \begin{tabular}{@{}lllllllllll@{}} (trtllm engine bfloat16 and float32)

code for transformers:

from PIL import Image

from transformers import NougatProcessor, VisionEncoderDecoderModel
from datasets import load_dataset
import torch

processor = NougatProcessor.from_pretrained("shenzhanyou/table_nougat")
model = VisionEncoderDecoderModel.from_pretrained("shenzhanyou/table_nougat")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

image = Image.open(filepath) # test_image above 
pixel_values = processor(image, return_tensors="pt").pixel_values

# verify generation
outputs = model.generate(
    pixel_values,
    min_length=1,
    max_length=4096,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
    bad_words_ids=[
        [tokenizer.unk_token_id],
    ],
    return_dict_in_generate=True,
    do_sample=False,
)
generated = tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)[0]

additional notes

Not all the images show different results between transformers and trtllm v0.12.0. This image is a weird one.

NVIDIA / TensorRT-LLM