Inconsistent Results Between Python Runtime and Python-Binding-C++ When Running TRT-LLM Multimodel

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

8.5k stars 964 forks source link

System Info

When using TRT-LLM to run multimodel, I found that the results are inconsistent between using the Python runtime and the Python-binding-C++ (the Python runtime results are correct, while the Python-binding-C++ results occasionally have errors).

For the Python runtime, I used TensorRT-LLM/examples/multimodal/run.py, while for the Python-binding, I tested using a service set up with the Triton Inference Server. I also tested something similar to TensorRT-LLM/examples/bindings/executor/example_basic.py. The model is InternVL2 (vision-encoder + Qwen2)。

I strictly aligned the prompt tuning and input_id, but there are still issues. How should I investigate or resolve this problem?

Thanks~

Who can help?

@byshiue

Information

[ ] The official example scripts
[x] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

I'm not able to provide it right now, I'm working on it

Expected behavior

The C++ results should be the same as the Python runtime results

actual behavior

The C++ results are sometimes inaccurate

additional notes

trt-llm version tensorrt-llm 0.13.0.dev2024082000

NVIDIA / TensorRT-LLM