TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Download my own finetune_version of Nougat, which has the same architecture with nougat base 0.1.0, only change the model weights. # Clone finetune-version of nougat model
git lfs install
git clone https://huggingface.co/shenzhanyou/table_nougat
and copy the model to examples/multimodal/tmp/hf_models/${MODEL_NAME} to align with the official example script.
Follow the tutorial of Nougat and transform the original model above to bfloat16 and float32 version. (Only show the bfloat16 cmd, you can replace bfloat16 with float32 to check float32 accuracy)
Only replace the test image in examples/multimodal/run.py with my own image below to check result.
python run.py \
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
--visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \
--llm_engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16
which is different from the trtllm inference result with the first line:
\begin{tabular}{@{}llcccccccc@{}} (original transformer)
and
\begin{tabular}{@{}lllllllllll@{}} (trtllm engine bfloat16 and float32)
System Info
Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.35 Python version: 3.10.12 PyTorch version (GPU?): 2.4.0+cu121 (True) [TensorRT-LLM] TensorRT-LLM version: 0.12.0 Driver Version: 535.161.08 CUDA Version: 12.5 GPU: A40 single card
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Follow the tutorial of TensorRT-LLM Linux install the docker and TensorRT-LLM 0.12.0 https://nvidia.github.io/TensorRT-LLM/installation/linux.html
Download my own finetune_version of Nougat, which has the same architecture with nougat base 0.1.0, only change the model weights. # Clone finetune-version of nougat model git lfs install git clone https://huggingface.co/shenzhanyou/table_nougat and copy the model to examples/multimodal/tmp/hf_models/${MODEL_NAME} to align with the official example script.
Follow the tutorial of Nougat and transform the original model above to bfloat16 and float32 version. (Only show the bfloat16 cmd, you can replace bfloat16 with float32 to check float32 accuracy)
python build_visual_engine.py --model_type nougat --model_path tmp/hf_models/${MODEL_NAME}
Only replace the test image in examples/multimodal/run.py with my own image below to check result. python run.py \ --hf_model_dir tmp/hf_models/${MODEL_NAME} \ --visual_engine_dir tmp/trt_engines/${MODEL_NAME}/vision_encoder \ --llm_engine_dir tmp/trt_engines/${MODEL_NAME}/1-gpu/bfloat16
Expected behavior
the output of original version of NougatModel is:
actual behavior
bfloat16 and float32 trt_engine gives the same output below.
which is different from the trtllm inference result with the first line: \begin{tabular}{@{}llcccccccc@{}} (original transformer) and \begin{tabular}{@{}lllllllllll@{}} (trtllm engine bfloat16 and float32)
code for transformers:
additional notes
Not all the images show different results between transformers and trtllm v0.12.0. This image is a weird one.