Open zah-tane opened 7 months ago
Hey @zah-tane you're correct that llama.cpp can do this by shifting the kv cache. This isn't yet implemented in the python api at the moment. I believe #1106 does this though I haven't had a chance to review / test that PR yet.
@abetlen Thank you for your comment. I will try to update this thread if I get the chance to test #1106.
Any progress here? Or is it still the case that there's a crash when messages
gets too long? I would think the idea situation would be to keep the system prompt if it exists, then append to the bottom and chop off the top after the system prompt, no?
I suppose some workaround might be to keep a token count of every entry in messages
, including various markers and delimiters and do a round-robin sort of thing so that as messages
approaches the token maximum, it starts popping prompts from the top, providing room for the reply as well. Hmm. Is this the best solution I wonder...?
I want to have longer conversations with the model, but as I understand, the number of tokens in the prompt message + the number of tokens generated by the model should be less than the context size (
n_ctx
). As the conversation develops, more tokens are stored inmessages
, which would eventually result in the following error:ValueError: Requested tokens (2083) exceed context window of 2048
I am new to all this, but I think the way it is handled in
llama.cpp
is that it is managed with then_keep
parameter (here is a helpful discussion). However, I can't find how to avoid the mentioned error usingllama-cpp-python
. I think the chat context should be reset at some point, because even if we usen_keep
(which I don't know how to do using this repo), the context window would still get filled up.So, my question is how can I handle longer chats when using
create_chat_completion
and avoid getting the mentioned error using some functionality similar ton_keep
combined with resettingtotal_tokens
and emptying the current context when filled up?Thank you!