LMCache / LMCache

Ultra-Fast and Cheaper Long-Context LLM Inference
https://lmcache.ai/
Apache License 2.0
194 stars 22 forks source link

Support saving decode KV Cache for more models (better if there's generic solution) #151

Open YaoJiayi opened 1 month ago

XbzOnGit commented 3 weeks ago

This is a problem with chat templates.
What LMCache sees is the "assistant reply" generated by decoding immediately following the prompt of user. But with a chat template, it is possible that in the next turn of conversation, some tokens are inserted before the decoded tokens, making the prefix differ.
For instance, for mistralai/Mistral-7B-Instruct-v0.2, {%- elif message['role'] == 'assistant' %} {{- ' ' + message['content'] + eos_token}} will insert one token before assistant reply(decoded tokens).

Here is an example.

If user prompts ["Say", "Hello"], and the first prompt is "<s>[INST]Say Hello[/INST]", 
and LMCache sees "<s>[INST]Say Hello[/INST]Hello", 
but in the next turn, the prompt can be "<s>[INST]Say Hello[/INST] Hello" due to chat template. 
The extra space betwen "[/INST]" and "Hello" can cause mismatch in prefix.