ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.13k stars 9.34k forks source link

Bug: Phi-3 mini 128k performance degradation with kv size > 8k (server) #8995

Closed steampunque closed 3 weeks ago

steampunque commented 1 month ago

What happened?

I ran some benches on Phi-3 mini 128k and notice a large performance drop in lambada from 0.618 to 0.496 acc. I traced the problem to increasing the size of the kv cache above 8k with the server (at any value above 8k the acc will drop to 0.494). Performance on other benches is also degraded when kv cache size is above 8k.

NKV LAMBADA ACC
4k 0.620
8k 0.620
10k 0.494

Name and Version

b3565

What operating system are you seeing the problem on?

Linux

Relevant log output

The issue can be exposed using cli with a single lambada prompt, but the error threshold for failure seems to be 4k instead of 8k as on server for some strange reason (i.e. cli fails at any kv >4k on a lambada test prompt that should work, server will fail >8k with this same prompt).

lambada test prompt, nkv=10240 (10k) or nkv=4096 (4k) (-c parameter):

llama-cli -m /data3hd/models/Phi-3-mini-128k-instruct.Q6_K.gguf      --color -n
-1  --log-disable    -ngl 33 -c 10240 -ctk f16 -ctv f16 -b 128  -n 10 --keep 0
  --temp 0.0 --dynatemp-range 0.0 --dynatemp-exp 1.0    --top-k 40 --top-p 0.95
--typical 1.0 --min-p 0.00    --repeat-last-n 64 --repeat-penalty 1.0    --prese
nce-penalty 0.0 --frequency-penalty 0.0    --tfs 1.0    --mirostat 0 --mirostat-
lr 0.1 --mirostat-ent 5.0    -p "in my palm is a clear stone , and inside it is
a small ivory statuette . a guardian angel . `` figured if you 're going to be o
ut at night getting hit by cars , you might as well have some backup . '' i look
 at him , feeling stunned . like this is some sort of sign . but as i stare at h
arlin , his mouth curved in a confident grin , i do n't care about"

nkv=10240, incorrect answer:

-->  the angel . i ' m going to kill

nkv=4096, correct answer:

--> signs . i 'm going to kill him

The same behavior will show with -ngl 0 (full CPU on a 9900k) or -ngl 33 (full offload to a 4070).
steampunque commented 1 month ago

single line test prompt:

llama-cli -m /data3hd/models/Phi-3-mini-128k-instruct.Q6_K.gguf --color -n -1 --log-disable -ngl 0 -c 4096 -ctk f16 -ctv f16 -b 128 -n 10 --keep 0 --temp 0.0 --dynatemp-range 0.0 --dynatemp-exp 1.0 --top-k 40 --top-p 0.95 --typical 1.0 --min-p 0.00 --repeat-last-n 64 --repeat-penalty 1.0 --presence-penalty 0.0 --frequency-penalty 0.0 --tfs 1.0 --mirostat 0 --mirostat-lr 0.1 --mirostat-ent 5.0 -p "in my palm is a clear stone , and inside it is a small ivory statuette . a guardian angel . `` figured if you 're going to be out at night getting hit by cars , you might as well have some backup . '' i look at him , feeling stunned . like this is some sort of sign . but as i stare at harlin , his mouth curved in a confident grin , i do n't care about"

JohannesGaessler commented 1 month ago

Do you experience similar issues using other models?

steampunque commented 1 month ago

Do you experience similar issues using other models?

Llama 3.1 (also 128k context):

3634/5153 in all cases

NKV LAMBADA ACC
4096 0.705
8192 0.705
10240 0.705
16384 0.705
llama-cli -m /data3hd/models/Meta-Llama-3.1-8B-Instruct.Q6_K.gguf  --color -n -1  --log-disable    -ngl 33 -c 4096 -ctk f16 -ctv f16 -b 128  -n 10 --keep 0    --temp 0.0 --dynatemp-range 0.0 --dynatemp-exp 1.0    --top-k 40 --top-p 0.95 --typical 1.0 --min-p 0.00    --repeat-last-n 64 --repeat-penalty 1.0    --presence-penalty 0.0 --frequency-penalty 0.0    --tfs 1.0    --mirostat 0 --mirostat-lr 0.1 --mirostat-ent 5.0    -p "in my palm is a clear stone , and inside it is a small ivory statuette . a guardian angel . `` figured if you 're going to be out at night getting hit by cars , you might as well have some backup . '' i look at him , feeling stunned . like this is some sort of sign . but as i stare at harlin , his mouth curved in a confident grin , i do n't care about"

nkv=10240, semantically correct answer (though I grade it a mistmach, need to update my grading logic in bench, but this is a nonissue here)

the sign . i care about the fact that he

nkv=4096, same answer

the sign . i care about the fact that he

The phi-3 test is using fresh converts after the recent sliding window patch that broke all the phi-3 models.

steampunque commented 4 weeks ago

I decided to re open this since the problem is still showing up with phi 3.5 mini and someone else also posted a complaint about phi mini so there is almost certainly a bug somewhere in the inference platform when running this model. The symptoms show kv cache > 8192 as the degradation trip point. Both phi 3 mini 128k and phi 3.5 mini can be used with kv cache size up to 8192 with no performance degradation (my current workaround for the problem) but above 8192 performance takes a sharp dive.

Phi 3.5 mini, b3609

NKV LAMBADA ACC
4k 0.673
8k (8192) 0.673
8194 0.613
19k 0.613
steampunque commented 3 weeks ago

Well phi-3 medium does the same thing. glm-4, internlm, llama3.1, all SOTA high context models are all OK. The difference I believe is LONGROPE which only phi-3 has, from M$:

What does LongRoPE do?

The LongRoPE algorithm is built upon the two forms of non-uniformities in positional interpolation: varying RoPE dimensions and token positions. In order to achieve the best performance on long context windows using non-uniform positional embeddings, LongRoPE:

    Exploit the best positional embedding rescaling parameters through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios;
    Introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window;

    Readjust scaling factors and retained start tokens on 8k length to recover the short context window performance.

Notice the 8k in the blurb above. I don't think its a coincidence this is where performance gets munged, though I don't understand why just having NKV bigger than 8k would trigger it. All the LAMBADA prompts are in the range of ~100 tokens and only 3 or 4 output tokens are generated in the test to create the next predicted word.

ggerganov commented 3 weeks ago

Probably related to the logic for choosing long/short rope factors:

https://github.com/ggerganov/llama.cpp/blob/80d9d2a5514ee6faa85b372b75e16d5edfecd437/src/llama.cpp#L9393-L9406

There might be some issue there, need to compare with the reference code

steampunque commented 3 weeks ago

Probably related to the logic for choosing long/short rope factors:

Aha. That is definately what is going on here. Its understandable that performance will take a hit with a very long prompts which rely on ROPE to work but for prompts which fit inside the natural sequence length of the model (apparently targeting 8k here) it seems like performance should be unaffected, so it seems like some kind of logic which dynamically selects the long/short freq factors based on the current number of tokens in the KV cache (not its max size) is needed. This code will always penalize performance of the model to the performance of LONG rope just by configuring a KV cache greater than 8192 (apparently the value of hparams.n_ctx_orig_yarn).

According to M$ all these original models will be affected:

[Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)
[Phi-3-small-128k-instruct](https://huggingface.co/microsoft/Phi-3-small-128k-instruct)
[Phi-3-medium-128k-instruct](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct)
[Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)

The new Phi-3.5 mini, vision, and MoE also all use LONGROPE will also be affected. The "original_max_token_embeddings" is 4096 so that might also explain some deviation in performance at the 4096 boundary which strangely only surfaced when I was using cli, but the long/short rope freq selection seems to be the big hitter.

steampunque commented 3 weeks ago

Well I dug into the code and dont see a quick fix for this. On the surface a simple hack could be used

// if (n_ctx_pre_seq > hparams.n_ctx_orig_yarn) { if (llama_get_kv_cache_used_cells(&lctx) > hparams.n_ctx_orig_yarn) { return model.layers[il].rope_long; }

but this gives the aggregate KV use not the use in the batch to be decoded and also does not account for the tokens which will be added to kv in the decode. The more top down way to handle would be to predict final output kv tokens in llama_decode and filter this info down to the llama_build_graph and then llama_build_phi3 routine as a parameter somehow, however the batching stuff seems to throw a monkey wrench into the works as there doesn't seem to be a way to config uniquely on a per seq. id basis inside the batch, i.e. the llama_build_phi3 is going to cover all seq. ids in the batch no matter how big each individual one is. Hence unless running single slot it seems not possible to adapt the config to the number of final kv tokens for a slot after decode, and it is a complete mess to have the performance potentially varying as a function of how many other unrelated slots are running.

Unless anyone else has ideas I think only resolution to this problem is to just document somewhere that for phi-3 models using LONGROPE, that if kv is sized >8192 it is going to be using long rope scaling all the time and performance will be degraded. I am guessing the other models with fixed ROPE scaling are essentially running this way all the time (the long context degradation is baked in for all prompts independent of their length). So phi-3 design potentially has at least the advantage of being able to get short context performance if the user knows long prompts won't be needed and can therefore config kv <=8192.