NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.38k stars 942 forks source link

[Nougat]Accuracy Problem: different output for both float32 and bfloat 16 trtllm engine with float32 huggingface original model #2207

Open ehuaa opened 1 month ago

ehuaa commented 1 month ago

System Info

Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35 Python version: 3.10.12 PyTorch version (GPU?): 2.4.0+cu121 (True) [TensorRT-LLM] TensorRT-LLM version: 0.12.0 Driver Version: 535.161.08 CUDA Version: 12.5 GPU: A40 single card

Who can help?

@byshiue

Information

Tasks

Reproduction

  1. Follow the tutorial of TensorRT-LLM Linux install the docker and TensorRT-LLM 0.12.0 https://nvidia.github.io/TensorRT-LLM/installation/linux.html

  2. Download my own finetune_version of Nougat, which has the same architecture with nougat base 0.1.0, only change the model weights. # Clone finetune-version of nougat model git lfs install git clone https://huggingface.co/shenzhanyou/table_nougat and copy the model to examples/multimodal/tmp/hf_models/${MODEL_NAME} to align with the official example script.

  3. Follow the tutorial of Nougat and transform the original model above to bfloat16 and float32 version. (Only show the bfloat16 cmd, you can replace bfloat16 with float32 to check float32 accuracy)

    python ../enc_dec/convert_checkpoint.py --model_type bart \
        --model_dir tmp/hf_models/${MODEL_NAME} \
        --output_dir tmp/trt_models/${MODEL_NAME}/bfloat16 \
        --tp_size 1 \
        --pp_size 1 \
        --dtype bfloat16 \
        --nougat
    
    trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/bfloat16/decoder \
        --output_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16/decoder \
        --paged_kv_cache disable \
        --moe_plugin disable \
        --enable_xqa disable \
        --gemm_plugin bfloat16 \
        --bert_attention_plugin bfloat16 \
        --gpt_attention_plugin bfloat16 \
        --remove_input_padding enable \
        --max_beam_width 1 \
        --max_batch_size 1 \
        --max_seq_len 101 \
        --max_input_len 1 \
        --max_encoder_input_len 588 # 1 (max_batch_size) * 588 (num_visual_features)

    python build_visual_engine.py --model_type nougat --model_path tmp/hf_models/${MODEL_NAME}

  4. Only replace the test image in examples/multimodal/run.py with my own image below to check result. python run.py \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \ --llm_engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16

Image

Expected behavior

the output of original version of NougatModel is:

\begin{tabular}{@{}llcccccccc@{}}
\hline\hline
\textsf{Dataset} & \textsf{Reference} & \textsf{Resolution} & \textsf{SS} & \textsf{NC} & \textsf{AD} & \textsf{TRT} & \textsf{ML} & \textsf{MP} & \textsf{DM} \\
 \hline
\textsf{\textsf{\textsf{DVS}-Gesture}} & \textsf{[56]} & \textsf{128\,$\times$\,128} & \textsf{1342} & \textsf{11} & \textsf{5s} & \textsf{1.86h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{SL-Animal}-DVS}} & \textsf{[57]} & \textsf{128\,$\times$\,128} & \textsf{1121} & \textsf{19} & \textsf{4.5s} & \textsf{1.4h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{Action}Recognition}} & \textsf{[58]} & \textsf{260\,$\times$\,346} & \textsf{291} & \textsf{10} & \textsf{5s} & \textsf{0.4h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{DailyAction}}} & \textsf{[37]} & \textsf{260\,$\times$\,346} & \textsf{1440} & \textsf{12} & \textsf{5s} & \textsf{2h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\hline
\textsf{\textsf{\textsf{\textsf{DVS}-SLR}} (Ours)} & \textsf{-} & \textsf{260\,$\times$\,346} & \textsf{5418} & \textsf{21} & \textsf{6s} & \textsf{9.03h} & \textsf{$\surd$} & \textsf{$\surd$} & \textsf{$\surd$} \\
\hline
\end{tabular}

actual behavior

bfloat16 and float32 trt_engine gives the same output below.

\begin{tabular}{@{}lllllllllll@{}}
\hline\hline
\textsf{Dataset} & \textsf{Reference} & \textsf{Resolution} & \textsf{SS} & \textsf{NC} & \textsf{AD} & \textsf{TRT} & \textsf{ML} & \textsf{MP} & \textsf{DM} \\
 \hline
\textsf{\textsf{\textsf{DVS-Gesture}}} & \textsf{[56]} & \textsf{128\,$\times$\,128} & \textsf{1342} & \textsf{11} & \textsf{5s} & \textsf{1.86h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{SL-Animal-DVS}}} & \textsf{[57]} & \textsf{128\,$\times$\,128} & \textsf{1121} & \textsf{19} & \textsf{4.5s} & \textsf{1.4h} & \textsf{$\times$} & \textsf{$\times$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{ActionRecognition}}} & \textsf{[58]} & \textsf{260\,$\times$\,346} & \textsf{291} & \textsf{10} & \textsf{5s} & \textsf{0.4h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\textsf{\textsf{\textsf{DailyAction}}} & \textsf{[37]} & \textsf{260\,$\times$\,346} & \textsf{1440} & \textsf{12} & \textsf{5s} & \textsf{2h} & \textsf{$\times$} & \textsf{$\surd$} & \textsf{$\times$} \\
\hline
\textsf{\textsf{\textsf{DVS-SLR}}} (\textsf{Ours}) & \textsf{-} & \textsf{260\,$\times$\,346} & \textsf{5418} & \textsf{21} & \textsf{6s} & \textsf{9.03h} & \textsf{$\surd$} & \textsf{$\surd$} & \textsf{$\surd$} \\
\hline
\end{tabular}

which is different from the trtllm inference result with the first line: \begin{tabular}{@{}llcccccccc@{}} (original transformer) and \begin{tabular}{@{}lllllllllll@{}} (trtllm engine bfloat16 and float32)

code for transformers:

from PIL import Image

from transformers import NougatProcessor, VisionEncoderDecoderModel
from datasets import load_dataset
import torch

processor = NougatProcessor.from_pretrained("shenzhanyou/table_nougat")
model = VisionEncoderDecoderModel.from_pretrained("shenzhanyou/table_nougat")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

image = Image.open(filepath) # test_image above 
pixel_values = processor(image, return_tensors="pt").pixel_values

# verify generation
outputs = model.generate(
    pixel_values,
    min_length=1,
    max_length=4096,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
    bad_words_ids=[
        [tokenizer.unk_token_id],
    ],
    return_dict_in_generate=True,
    do_sample=False,
)
generated = tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)[0]

additional notes

Not all the images show different results between transformers and trtllm v0.12.0. This image is a weird one.

github-actions[bot] commented 1 day ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."