Open Oldpan opened 4 days ago
Hi @Oldpan , can you try the latest main branch. We updated exactly today to turn on C++ runtime for all multimodal's LLM part. You can use --use_py_session
to test python runtime, and default w/o flag is to test cpp runtime, as in https://github.com/NVIDIA/TensorRT-LLM/blob/1730a587d806be2397ee75722ea2b35dd8631c70/examples/multimodal/run.py#L75
It would be better if you can verify the match with this run.py script before moving to Triton. Can you give a try and report back?
System Info
When using TRT-LLM to run multimodel, I found that the results are inconsistent between using the Python runtime and the Python-binding-C++ (the Python runtime results are correct, while the Python-binding-C++ results occasionally have errors).
For the Python runtime, I used TensorRT-LLM/examples/multimodal/run.py, while for the Python-binding, I tested using a service set up with the Triton Inference Server. I also tested something similar to TensorRT-LLM/examples/bindings/executor/example_basic.py. The model is InternVL2 (vision-encoder + Qwen2)。
I strictly aligned the prompt tuning and input_id, but there are still issues. How should I investigate or resolve this problem?
Thanks~
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm not able to provide it right now, I'm working on it
Expected behavior
The C++ results should be the same as the Python runtime results
actual behavior
The C++ results are sometimes inaccurate
additional notes