Closed janekb04 closed 5 months ago
Seems like it's due to context swapping. The context limit of llama is 2048 tokens, after that they do a "context swapping":
// infinite text generation via context swapping
// if we run out of context:
// - take the n_keep first tokens from the original prompt (via n_past)
// - take half of the last (n_ctx - n_keep) tokens and recompute the logits in a batch
https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L256
@ngxson This makes sense. This is invisible in, say, ChatGPT because there, this context recreation happens only after it has finished writing - when it's the user's turn.
Would it make sense to track how full the context is in interactive mode? So that we could swap the context (or in this case clear a part of it) while the user is typing the next question?
It could also work like ChatGPT. There, the context is recreated every time the user sends a message. The tokens in the message are counted, the max response length is added to that and then as much history is prepended to that. Though I don't know what that would be performance-wise, as context recreation seems rather expensive in llama.cpp
.
Here, I think a better solution would be to recreate the context as soon as LLama stops typing. We would assume that the user's query + LLama's response must be no longer than a certain limit.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Tokens are generated at about a constant rate, ie. N tokens per second on a given machine.
Current Behavior
Sometimes, the LLM takes a much longer time to generate a token than usually. It can be a 10x slowdown.
Environment and Context
Setup MacBook Pro 14-inch 2021 10-core Apple M1 Pro CPU 16 GB RAM OS MacOS Ventura 13.3 (22E252) clang --version
Steps to Reproduce
Run
./main
-m
./models/ggml-vicuna-7b-4bit-rev1.bin
-n
512
--color
-f
prompts/chat-with-vicuna.txt
--seed
42
--mlock
The model will get stuck after "of":
...or visit one of▏
the city's many restaurants...Failure Logs
Video