Open leszekhanusz opened 1 year ago
Just to add, this is not a problem with llama.cpp itself. I can do very long conversations with llama.cpp in interactive mode. Also, I ran into this in a situation where the context size wasn't anywhere near 2048. It just plainly refused to generate more tokens.
So it seems other people are reporting the issue via Ooba in #331. I attempted to reproduce directly in llama-cpp-python
, but couldn't.
Having the same issue
Having the same issue
Describe exactly how this happened to you.
Using a matrix bot thats hooked up to the oobabooga textgen using llama cpp python. It seems to start throwing the error after only a few messages
I'm trying to make long stories using a llama.cpp model (
guanaco-33B.ggmlv3.q4_0.bin
in my case) withoobabooga/text-generation-webui
.It works for short inputs but it stops working once the number of input tokens is coming close to the context size (2048).
With a bit of playing with the webui (you can count input tokens and modify the
max_new_tokens
on the main page) I found out that the behavior is like this:if nb_input_tokens + max_new_tokens < context_size , then it works correctly. if nb_input_tokens < context_size but nb_input_tokens + max_new_tokens > context_size , then it fails silently, generating 0 tokens:
if
nb_input_tokens
>context_size
, then it fails with:I've seen issue #92 of llama-cpp-python but it is closed and I'm on a recent version of
llama-cpp-python
(release 0.1.57)llama-cpp-python
should probably discard some input tokens at the beginning to be able to fit inside the context and allow us to continue long stories.