[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
I am using the PytorchEngineConfig to build a qwen2-vl triton service.
I sent the following three requests to the server in sequence:
Request 1 -> Response A
Request 2 -> Response B
Request 3 (same content as Request 1) -> Response B
Ideally, Request 3 should have received Response A. I suspect that Request 3 was parsed with the same prefix as Request 2, resulting in the same response as Request 2.
Could you please confirm whether PytorchEngine supports prefix caching for visual models, and if so, how it might be affecting the responses?
Checklist
Describe the bug
I am using the PytorchEngineConfig to build a qwen2-vl triton service. I sent the following three requests to the server in sequence:
Could you please confirm whether PytorchEngine supports prefix caching for visual models, and if so, how it might be affecting the responses?
Thank you for your assistance.
Reproduction
build triton server:
then send request:
Environment
Error traceback
No response