InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.72k stars 432 forks source link

[Bug] Does PytorchEngine Visual Model Support Prefix Caching? #2789

Open OftenDream opened 5 days ago

OftenDream commented 5 days ago

Checklist

Describe the bug

I am using the PytorchEngineConfig to build a qwen2-vl triton service. I sent the following three requests to the server in sequence:

  1. Request 1 -> Response A
  2. Request 2 -> Response B
  3. Request 3 (same content as Request 1) -> Response B Ideally, Request 3 should have received Response A. I suspect that Request 3 was parsed with the same prefix as Request 2, resulting in the same response as Request 2.

Could you please confirm whether PytorchEngine supports prefix caching for visual models, and if so, how it might be affecting the responses?

Thank you for your assistance.

Reproduction

build triton server:

        enable_prefix_caching=True
        engine_config = PytorchEngineConfig(
            tp=tp,
            cache_max_entry_count=cache_max_entry_count,
            enable_prefix_caching=enable_prefix_caching)
        self.engine = pipeline(model_path=model_path,
                               model_name=model_name,
                               backend_config=engine_config,
                               log_level='INFO')

then send request:

          with grpcclient.InferenceServerClient(self.server_url) as client:
            inputs = [
                self.prepare_tensor('max_tokens', np.array([max_tokens], dtype=np.int32)),
                self.prepare_tensor('temperature', np.array([temperature], dtype=np.float32)),
                self.prepare_tensor('top_p', np.array([top_p], dtype=np.float32)),
                self.prepare_tensor('top_k', np.array([top_k], dtype=np.int32)),
                self.prepare_tensor('stream', np.array([stream], dtype=np.bool_)),
                self.prepare_tensor('messages', np.array([prompt], dtype=np.object_)),
                self.prepare_tensor('text', np.array([text], dtype=np.object_)),
                self.prepare_tensor('ignore_eos', np.array([ig_eos], dtype=np.bool_))
            ]
        client.start_stream(partial(self.stream_callback))
        client.async_stream_infer(self.model_name, inputs, sequence_start=True, sequence_end=True)

Environment

cuda=11.8
lmdeploy=0.6.2
torch=2.4.0

Error traceback

No response

grimoire commented 4 days ago

Nope, we don't have a good solution to match the multimodal features. Prefix caching is not supported on any vl model.