NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.5k stars 964 forks source link

Inconsistent Results Between Python Runtime and Python-Binding-C++ When Running TRT-LLM Multimodel #2362

Open Oldpan opened 4 days ago

Oldpan commented 4 days ago

System Info

When using TRT-LLM to run multimodel, I found that the results are inconsistent between using the Python runtime and the Python-binding-C++ (the Python runtime results are correct, while the Python-binding-C++ results occasionally have errors).

For the Python runtime, I used TensorRT-LLM/examples/multimodal/run.py, while for the Python-binding, I tested using a service set up with the Triton Inference Server. I also tested something similar to TensorRT-LLM/examples/bindings/executor/example_basic.py. The model is InternVL2 (vision-encoder + Qwen2)。

I strictly aligned the prompt tuning and input_id, but there are still issues. How should I investigate or resolve this problem?

Thanks~

Who can help?

@byshiue

Information

Tasks

Reproduction

I'm not able to provide it right now, I'm working on it

Expected behavior

The C++ results should be the same as the Python runtime results

actual behavior

The C++ results are sometimes inaccurate

additional notes

symphonylyh commented 3 days ago

Hi @Oldpan , can you try the latest main branch. We updated exactly today to turn on C++ runtime for all multimodal's LLM part. You can use --use_py_session to test python runtime, and default w/o flag is to test cpp runtime, as in https://github.com/NVIDIA/TensorRT-LLM/blob/1730a587d806be2397ee75722ea2b35dd8631c70/examples/multimodal/run.py#L75

It would be better if you can verify the match with this run.py script before moving to Triton. Can you give a try and report back?