Open m9e opened 1 month ago
This is normal behaviour - with small context size (such as 512) the server
will automatically discard past tokens when the context becomes full. With instruction tuned models such as the one that you use, this can become catastrophic because the chat template likely gets destroyed and the model goes OOD. Using --ctx-size 0
will give you optimal behaviour, utilizing the maximum context for the used model
I understand the point there. Would there be downsides to making --ctx-size 0 the default when loading a instruction tuned model? Or on any given generation to set -n to (max_ctx - input_ctx) (eg, "auto max new tokens" behavior)? Or have a flag for that?
Or:
Just a bunch of thoughts. Obviously I was being ignorant here (in my mind, I was conflating app-level context flushing via truncation/etc with the model level, which is obviously much more painful to our output!), but it feels like the expected behavior here is a bit of a trap that could be avoided. But I'm not sure if I'm not seeing downsides.
Yes, it can be improved. I will try to address similar issues within #7675
This issues seems to be related to mine https://github.com/ggerganov/llama.cpp/issues/7929#issue-2352272658 With today's version the problem with garbage output seems gone. Everything works as B3080 version except for context window. Before when output reached the context window size it would just reset and continue answering question forever, now once the context windows is filled with output from multiple questions the generation just stops, is there way to free context window after it gets filled automatically?
Here is how I run it: llama.cpp/llama-cli --model ../../models/meta-llama-3-8b-instruct_q5_k_s.gguf --n-gpu-layers 35 -cnv --interactive-first --simple-io --interactive -b 2048 --ctx_size 4096 --temp 0.3 --top_k 10 --multiline-input --repeat_penalty 1.12 -t 6 --chat-template llama3
What happened?
On OSX, on 02c1ecad07f0e2d2febe8196271bcc64bdc9c006, running:
./server -m /var/tmp/models/bartowski/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf --mlock --host 0.0.0.0 --port 51039 -ngl 999 --chat-template llama3
I get total incoherence around token ~512. A clipped output:
in one run I let the incoherence go on for quite a long time and after what may have been another ~512 tokens (just eyeballing it), it suddenly resolved back into coherence with some hallucinated lyrics.
On the other hand, if I start it with
./server -m /var/tmp/models/bartowski/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf --ctx-size 0 --mlock -ngl 999 --chat-template llama3 --port 50051
then all is well.
During startup, the version without
--ctx-size 0
will print:the version with
--ctx-size 0
will print:additionally, I believe this message in the logs only appears in the version lacking the param:
{"tid":"0x202cdfac0","timestamp":1716964019,"level":"INFO","function":"update_slots","line":1851,"msg":"slot context shift","id_slot":0,"id_task":0,"n_keep":0,"n_left":511,"n_discard":255,"n_ctx":512,"n_past":511,"n_system_tokens":0,"n_cache_tokens":511}
and I believe that exactly corresponds with the incoherence.
Name and Version
(venv) bash-3.2$ ./main --version version: 3028 (02c1ecad) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0 (venv) bash-3.2$
What operating system are you seeing the problem on?
Mac
Relevant log output
No response