abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.72k stars 930 forks source link

Question: How to keep longer conversations without breaking the code? #1139

Open zah-tane opened 7 months ago

zah-tane commented 7 months ago

I want to have longer conversations with the model, but as I understand, the number of tokens in the prompt message + the number of tokens generated by the model should be less than the context size (n_ctx). As the conversation develops, more tokens are stored in messages, which would eventually result in the following error:

ValueError: Requested tokens (2083) exceed context window of 2048

I am new to all this, but I think the way it is handled in llama.cpp is that it is managed with the n_keep parameter (here is a helpful discussion). However, I can't find how to avoid the mentioned error using llama-cpp-python. I think the chat context should be reset at some point, because even if we use n_keep (which I don't know how to do using this repo), the context window would still get filled up.

So, my question is how can I handle longer chats when using create_chat_completion and avoid getting the mentioned error using some functionality similar to n_keep combined with resetting total_tokens and emptying the current context when filled up?

Thank you!

abetlen commented 7 months ago

Hey @zah-tane you're correct that llama.cpp can do this by shifting the kv cache. This isn't yet implemented in the python api at the moment. I believe #1106 does this though I haven't had a chance to review / test that PR yet.

zah-tane commented 7 months ago

@abetlen Thank you for your comment. I will try to update this thread if I get the chance to test #1106.

fat-tire commented 3 months ago

Any progress here? Or is it still the case that there's a crash when messages gets too long? I would think the idea situation would be to keep the system prompt if it exists, then append to the bottom and chop off the top after the system prompt, no?

I suppose some workaround might be to keep a token count of every entry in messages, including various markers and delimiters and do a round-robin sort of thing so that as messages approaches the token maximum, it starts popping prompts from the top, providing room for the reply as well. Hmm. Is this the best solution I wonder...?