This is a problem with chat templates.
What LMCache sees is the "assistant reply" generated by decoding immediately following the prompt of user. But with a chat template, it is possible that in the next turn of conversation, some tokens are inserted before the decoded tokens, making the prefix differ.
For instance, for mistralai/Mistral-7B-Instruct-v0.2, {%- elif message['role'] == 'assistant' %} {{- ' ' + message['content'] + eos_token}} will insert one token before assistant reply(decoded tokens).
Here is an example.
If user prompts ["Say", "Hello"], and the first prompt is "<s>[INST]Say Hello[/INST]",
and LMCache sees "<s>[INST]Say Hello[/INST]Hello",
but in the next turn, the prompt can be "<s>[INST]Say Hello[/INST] Hello" due to chat template.
The extra space betwen "[/INST]" and "Hello" can cause mismatch in prefix.
This is a problem with chat templates.
What LMCache sees is the "assistant reply" generated by decoding immediately following the prompt of user. But with a chat template, it is possible that in the next turn of conversation, some tokens are inserted before the decoded tokens, making the prefix differ.
For instance, for mistralai/Mistral-7B-Instruct-v0.2, {%- elif message['role'] == 'assistant' %} {{- ' ' + message['content'] + eos_token}} will insert one token before assistant reply(decoded tokens).
Here is an example.