[Bug] Does PytorchEngine Visual Model Support Prefix Caching?

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I am using the PytorchEngineConfig to build a qwen2-vl triton service. I sent the following three requests to the server in sequence:

Request 1 -> Response A
Request 2 -> Response B
Request 3 (same content as Request 1) -> Response B Ideally, Request 3 should have received Response A. I suspect that Request 3 was parsed with the same prefix as Request 2, resulting in the same response as Request 2.

Could you please confirm whether PytorchEngine supports prefix caching for visual models, and if so, how it might be affecting the responses?

Thank you for your assistance.

Reproduction

build triton server:

        enable_prefix_caching=True
        engine_config = PytorchEngineConfig(
            tp=tp,
            cache_max_entry_count=cache_max_entry_count,
            enable_prefix_caching=enable_prefix_caching)
        self.engine = pipeline(model_path=model_path,
                               model_name=model_name,
                               backend_config=engine_config,
                               log_level='INFO')

then send request:

          with grpcclient.InferenceServerClient(self.server_url) as client:
            inputs = [
                self.prepare_tensor('max_tokens', np.array([max_tokens], dtype=np.int32)),
                self.prepare_tensor('temperature', np.array([temperature], dtype=np.float32)),
                self.prepare_tensor('top_p', np.array([top_p], dtype=np.float32)),
                self.prepare_tensor('top_k', np.array([top_k], dtype=np.int32)),
                self.prepare_tensor('stream', np.array([stream], dtype=np.bool_)),
                self.prepare_tensor('messages', np.array([prompt], dtype=np.object_)),
                self.prepare_tensor('text', np.array([text], dtype=np.object_)),
                self.prepare_tensor('ignore_eos', np.array([ig_eos], dtype=np.bool_))
            ]

        client.start_stream(partial(self.stream_callback))
        client.async_stream_infer(self.model_name, inputs, sequence_start=True, sequence_end=True)

Environment

cuda=11.8
lmdeploy=0.6.2
torch=2.4.0

Error traceback

No response

InternLM / lmdeploy

[Bug] Does PytorchEngine Visual Model Support Prefix Caching? #2789

Checklist

Describe the bug

Reproduction

Environment

Error traceback