ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.45k stars 8.79k forks source link

Bug: Default CTX on llama3 causes incoherence in server when 512 tokens passed in output #7609

Open m9e opened 1 month ago

m9e commented 1 month ago

What happened?

On OSX, on 02c1ecad07f0e2d2febe8196271bcc64bdc9c006, running:

./server -m /var/tmp/models/bartowski/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf --mlock --host 0.0.0.0 --port 51039 -ngl 999 --chat-template llama3

I get total incoherence around token ~512. A clipped output:

User: name 50 ai pioneers, 1 per line, with 10-12 words on why each belongs on the list

Llama: Here is the list of 50 AI pioneers, one per line, with a brief description of why each belongs on the list:

1. Alan Turing - Founded computer science and proposed the Turing Test.
2. Marvin Minsky - Developed theory of artificial neural networks (ANNs).
[snip]
25. Eric Horvitz - Advanced decision theory and uncertainty in AI.
26. David Andrew Peter029.42128
3120274.

623)5]
.

3632922)
.

in one run I let the incoherence go on for quite a long time and after what may have been another ~512 tokens (just eyeballing it), it suddenly resolved back into coherence with some hallucinated lyrics.

On the other hand, if I start it with

./server -m /var/tmp/models/bartowski/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct-Q5_K_M.gguf --ctx-size 0 --mlock -ngl 999 --chat-template llama3 --port 50051

then all is well.

During startup, the version without --ctx-size 0 will print:

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512

the version with --ctx-size 0 will print:

llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512

additionally, I believe this message in the logs only appears in the version lacking the param:

{"tid":"0x202cdfac0","timestamp":1716964019,"level":"INFO","function":"update_slots","line":1851,"msg":"slot context shift","id_slot":0,"id_task":0,"n_keep":0,"n_left":511,"n_discard":255,"n_ctx":512,"n_past":511,"n_system_tokens":0,"n_cache_tokens":511}

and I believe that exactly corresponds with the incoherence.

Name and Version

(venv) bash-3.2$ ./main --version version: 3028 (02c1ecad) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0 (venv) bash-3.2$

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

ggerganov commented 1 month ago

This is normal behaviour - with small context size (such as 512) the server will automatically discard past tokens when the context becomes full. With instruction tuned models such as the one that you use, this can become catastrophic because the chat template likely gets destroyed and the model goes OOD. Using --ctx-size 0 will give you optimal behaviour, utilizing the maximum context for the used model

m9e commented 1 month ago

I understand the point there. Would there be downsides to making --ctx-size 0 the default when loading a instruction tuned model? Or on any given generation to set -n to (max_ctx - input_ctx) (eg, "auto max new tokens" behavior)? Or have a flag for that?

Or:

Just a bunch of thoughts. Obviously I was being ignorant here (in my mind, I was conflating app-level context flushing via truncation/etc with the model level, which is obviously much more painful to our output!), but it feels like the expected behavior here is a bit of a trap that could be avoided. But I'm not sure if I'm not seeing downsides.

ggerganov commented 1 month ago

Yes, it can be improved. I will try to address similar issues within #7675

dspasyuk commented 3 weeks ago

This issues seems to be related to mine https://github.com/ggerganov/llama.cpp/issues/7929#issue-2352272658 With today's version the problem with garbage output seems gone. Everything works as B3080 version except for context window. Before when output reached the context window size it would just reset and continue answering question forever, now once the context windows is filled with output from multiple questions the generation just stops, is there way to free context window after it gets filled automatically?

Here is how I run it: llama.cpp/llama-cli --model ../../models/meta-llama-3-8b-instruct_q5_k_s.gguf --n-gpu-layers 35 -cnv --interactive-first --simple-io --interactive -b 2048 --ctx_size 4096 --temp 0.3 --top_k 10 --multiline-input --repeat_penalty 1.12 -t 6 --chat-template llama3