Chat is very slow when using with the llama.cpp server

continuedev / continue

⏩ Continue is the leading open-source AI code assistant. You can connect any models and any context to build custom autocomplete and chat experiences inside VS Code and JetBrains

Apache License 2.0

19.25k stars 1.66k forks source link

Validations

[ ] I believe this is a way to improve. I'll try to join the Continue Discord for questions
[X] I'm not able to find an open issue that requests the same enhancement

Problem

Chat is very slow when using with the llama.cpp server when increasing the message because missing cache_prompt = true when call the /completion API of llama.cpp, so llama.cpp process also all previous message history when prompt instead of using from cache.

Solution

Please help add more property cache_prompt = true when call the /completion API, or add more configuration property into the config.json

Thanks

continuedev / continue