Working with long stories

leszekhanusz commented 1 year ago

I'm trying to make long stories using a llama.cpp model (guanaco-33B.ggmlv3.q4_0.bin in my case) with oobabooga/text-generation-webui.

It works for short inputs but it stops working once the number of input tokens is coming close to the context size (2048).

With a bit of playing with the webui (you can count input tokens and modify the max_new_tokens on the main page) I found out that the behavior is like this:

if nb_input_tokens + max_new_tokens < context_size , then it works correctly. if nb_input_tokens < context_size but nb_input_tokens + max_new_tokens > context_size , then it fails silently, generating 0 tokens:

Output generated in 0.25 seconds (0.00 tokens/s, 0 tokens, ...

if nb_input_tokens > context_size, then it fails with:

llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
Output generated in 0.28 seconds (0.00 tokens/s, 0 tokens, ...

I've seen issue #92 of llama-cpp-python but it is closed and I'm on a recent version of llama-cpp-python (release 0.1.57)

llama-cpp-python should probably discard some input tokens at the beginning to be able to fit inside the context and allow us to continue long stories.

agronholm commented 1 year ago

Just to add, this is not a problem with llama.cpp itself. I can do very long conversations with llama.cpp in interactive mode. Also, I ran into this in a situation where the context size wasn't anywhere near 2048. It just plainly refused to generate more tokens.

gjmulder commented 1 year ago

So it seems other people are reporting the issue via Ooba in #331. I attempted to reproduce directly in llama-cpp-python, but couldn't.

dillfrescott commented 1 year ago

Having the same issue

agronholm commented 1 year ago

Having the same issue

Describe exactly how this happened to you.

dillfrescott commented 1 year ago

Using a matrix bot thats hooked up to the oobabooga textgen using llama cpp python. It seems to start throwing the error after only a few messages

abetlen / llama-cpp-python

Working with long stories #307