ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.34k stars 9.54k forks source link

Bug: llama.cpp does not use XTC sampler when given temperature == 0 even if temperature is not in sampler sequence #9904

Closed justinsteven closed 1 hour ago

justinsteven commented 19 hours ago

What happened?

It's possible I'm misunderstanding samplers and sampler parameters.

It's also possible this is a symptom of a larger problem, where "default" values for some samplers may cause other samplers to be not activated in llama-server.

Observed behaviour

There are several sampler parameters which, when given "do nothing" default values via llama-server's /completions API, seem to cause the XTC sampler to not be used.

Update: The above should be "Given a temperature of 0, even if temperature is not in the requested sampler sequence, the XTC sampler is not used"

The following JSON payload demonstrates the issue:

{
  "prompt": "<|im_start|>system\nYou are a creative story writer<|im_end|>\n<|im_start>user\nWrite a story about a wizard who is losing his ability to do magic, and tries everything to get it back.<|im_end|>\n<|im_start|>assistant\n",
  "n_predict": 512,
  "seed": 1,
  "xtc_probability": 0.5,
  "xtc_threshold": 0.1,
  "samplers": [
    "xtc"
  ],
  "top_k": 0,
  "tfs_z": 1,
  "top_p": 1,
  "min_p": 0,
  "temperature": 0
}

In my testing, this causes the XTC sampler to not be activated. The vibe was off, and the following hacky debugging that I added was not activating:

diff --git a/src/llama-sampling.cpp b/src/llama-sampling.cpp
index 2e655068..63e0d043 100644
--- a/src/llama-sampling.cpp
+++ b/src/llama-sampling.cpp
@@ -1084,6 +1084,7 @@ static void llama_sample_xtc_apply(struct llama_sampler * smpl, llama_token_data
         || cur_p->size < 2) {
         return;
     }
+    puts("ok");

     std::uniform_real_distribution<float> distribution(0.0f, 1.0f);
     float chance = distribution(ctx->rng);

Given the following simpler JSON payload, the hacky debugging was successfully activated:

{
  "prompt": "<|im_start|>system\nYou are a creative story writer<|im_end|>\n<|im_start>user\nWrite a story about a wizard who is losing his ability to do magic, and tries everything to get it back.<|im_end|>\n<|im_start|>assistant\n",
  "n_predict": 512,
  "seed": 1,
  "xtc_probability": 0.5,
  "xtc_threshold": 0.1,
  "samplers": [
    "xtc"
  ]
}

Furthermore, each of the things after my samplers array seem to individually cause XTC to not activate. For example, a temperature of 0 (without specifying any of top_k, tfs_z, top_p or min_p) is enough to cause XTC to not activate.

There may be other parameters, including sampler parameters, which cause XTC to not activate, but which I did not test.

(Update: I was wrong about this, it seems as though only temperature == 0 reproduces the issue)

This is problematic for clients such as SillyTavern, which seem to always send all samplers in the array but which rely on sending default parameters (e.g. 0 in the case of temperature) to cause them to be effectively disabled. Such a client will never be able to activate XTC not activate XTC if the user gives a temperature of 0 in the hopes of disabling the temperature sampler.

Expected behaviour

If XTC is in the samplers array, and xtc_threshold and xtc_probability meet the criteria for XTC to be used, XTC should be used regardless of parameters for other samplers.

More generally, if any sampler is in the samplers array, and its parameters meet the criteria for it to be used, it should be used regardless of parameters for other samplers (?)

Related

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 3923 (becfd387)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

MaggotHATE commented 16 hours ago

temperature of 0

AFAIK, such value activates greedy sampling which doesn't use sampling queue. You can look in common/sampling.cpp.

It should work with temperature > 0, and I don't think that temperature 0 is default normally.

justinsteven commented 15 hours ago

Oops. I said:

Furthermore, each of the things after my samplers array seem to individually cause XTC to not activate. For example, a temperature of 0 (without specifying any of top_k, tfs_z, top_p or min_p) is enough to cause XTC to not activate.

This is not true. I don't know how I managed to goof that testing. It seems to only be a temperature of 0 that causes XTC to not be done.

Also this is reproducible using llama-cli. I still have the puts("ok") hacked in when XTC does its work.

Without specifying a temperature:

$ /llama.cpp/llama-cli -t 8 -ngl 99 -m /models/anthracite-org/magnum-v3-34b-gguf/magnum-v3-34b-IQ4_XS.gguf -c 4096 --flash-attn -p "You are a creative story writer" --sampling-seq mx --min-p 0.02 --xtc-probability 0.5 --seed 1
[... SNIP ...]

sampler seed: 1
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.020, xtc_probability = 0.500, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> min-p -> xtc -> softmax -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

You are a creative story writerok
 whook
 shouldok
 beok
 gettingok
 publishedok
!ok
"ok

ok
Sook
 muchok
 forok
 herok
 trying took
 makeok
 me feelok
 badok
,ok
 Iok
 feltok
 GREATok
!ok
 Thanksok
 forok
 theok
 lovelyok
 boostok
 ofok
 confidence
llama_perf_sampler_print:    sampling time =       2.27 ms /    36 runs   (    0.06 ms per token, 15824.18 tokens per second)
llama_perf_context_print:        load time =    1775.83 ms
llama_perf_context_print: prompt eval time =      54.10 ms /     6 tokens (    9.02 ms per token,   110.90 tokens per second)
llama_perf_context_print:        eval time =     917.76 ms /    29 runs   (   31.65 ms per token,    31.60 tokens per second)
llama_perf_context_print:       total time =     984.03 ms /    35 tokens
Interrupted by user

(The output containing lots of ok\n is expected due to my gross debugging technique)

With a non-zero temperature specified:

$ /llama.cpp/llama-cli -t 8 -ngl 99 -m /models/anthracite-org/magnum-v3-34b-gguf/magnum-v3-34b-IQ4_XS.gguf -c 4096 --flash-attn -p "You are a creative story writer" --sampling-seq mx --min-p 0.02 --xtc-probability 0.5 --seed 1 --temp 1.0
[... SNIP ...]

sampler seed: 1
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.020, xtc_probability = 0.500, xtc_threshold = 0.100, typical_p = 1.000, temp = 1.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> min-p -> xtc -> softmax -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

You are a creative story writerok
 whook
 shouldok
 beok
 gettingok
 publishedok
!ok
"ok

ok
Sook
 muchok
 forok
 herok
 trying took
 makeok
 me feelok
 badok
,ok
 Iok
 feltok
 GREATok
!ok
 Thanksok
 forok
 theok
 lovelyok
 boostok
 ofok
 confidenceok
.ok
..ok
andok
 aok
 reminderok
 ofok
 howok
 Iok
 spendok
 myok
 free timeok
!ok
 Iok
 amok
 workingok
 onok
 getting
llama_perf_sampler_print:    sampling time =       3.66 ms /    55 runs   (    0.07 ms per token, 15027.32 tokens per second)
llama_perf_context_print:        load time =    1757.99 ms
llama_perf_context_print: prompt eval time =      53.56 ms /     6 tokens (    8.93 ms per token,   112.03 tokens per second)
llama_perf_context_print:        eval time =    1505.34 ms /    48 runs   (   31.36 ms per token,    31.89 tokens per second)
llama_perf_context_print:       total time =    1576.80 ms /    54 tokens
Interrupted by user

XTC is active in both cases. In both cases, the reported "sampler chain" is logits -> logit-bias -> penalties -> min-p -> xtc -> softmax -> dist

With a temperature of 0:

$ /llama.cpp/llama-cli -t 8 -ngl 99 -m /models/anthracite-org/magnum-v3-34b-gguf/magnum-v3-34b-IQ4_XS.gguf -c 4096 --
flash-attn -p "You are a creative story writer" --sampling-seq mx --min-p 0.02 --xtc-probability 0.5 --seed 1 --temp 0
[... SNIP ...]

sampler seed: 4294967295
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.020, xtc_probability = 0.500, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

You are a creative story writer. I enjoyed reading your story.
I am glad you enjoyed the story. I have been writing stories for a long time. I have a lot of stories to share.
I am glad you enjoyed the story. I have been writing stories for a long time. I have a lot of stories to share. I am glad you enjoyed the story. I have been writing stories for a long time. I have a lot of
llama_perf_sampler_print:    sampling time =       1.55 ms /    92 runs   (    0.02 ms per token, 59469.94 tokens per second)
llama_perf_context_print:        load time =    1786.14 ms
llama_perf_context_print: prompt eval time =      54.71 ms /     6 tokens (    9.12 ms per token,   109.68 tokens per second)
llama_perf_context_print:        eval time =    2680.06 ms /    85 runs   (   31.53 ms per token,    31.72 tokens per second)
llama_perf_context_print:       total time =    2769.77 ms /    91 tokens
Interrupted by user

XTC was not activated. And indeed, the sampler chain doesn't include XTC and ends with greedy, so it's probably indeed the else branch of:

https://github.com/ggerganov/llama.cpp/blob/becfd387f6919d99ec34b76c2522f90ac250c489/common/sampling.cpp#L174-L228

Good catch @MaggotHATE, and thank you for your work on XTC, I've been following it keenly :)

However, in this specific llama-cli case, the --sampling-seq was mx and so I'd argue that the temperature == 0 greedy shortcut probably shouldn't be taken.

@MaggotHATE said:

It should work with temperature > 0, and I don't think that temperature 0 is default normally.

Indeed it does work with temperature > 0

However, I'd say that someone who specifies a sampler chain excluding temperature should never send llama down the temperature == 0 greedy path (see above).

Furthermore, some clients may present a UI to the user that doesn't allow them to disable temperature. For example, SillyTavern always sends:

samplers: [
  'top_k',
  'tfs_z',
  'typical_p',
  'top_p',
  'min_p',
  'temperature'
],

The order is configurable by the user, but all samplers are always specified it seems. The user is directed to set sampler parameters to "do nothing" values (e.g. temperature = 0) if they wish to effectively disable a sampler. And so if SillyTavern adds xtc to the sampler chain for llama.cpp consumption, XTC (and other samplers?) will silently break if the user sets temperature to 0 to try to disable it.

I tried to do this because it was my impression that the recommended settings for XTC are only min_p and xtc and that temperature isn't recommended - but maybe I'm mistaken there.

And so - is this working as designed? Disabling samplers such as XTC in the case that temperature == 0 was surprising to me, but now that I know there is this greedy sampling behaviour I'm guessing it's an optimisation. In light of new samplers such as XTC, is this optimisation potentially harmful or confusing? Or is this WAD and the API of llama.cpp is such that if consuming software wants temperature to not be involved in the sampler chain but wants XTC (or other samplers?) they must:

slaren commented 15 hours ago

It is not an optimization, using a temperature of 0 is intended to enable greedy sampling. Greedy sampling implies that all other samplers are disabled, so that's what it does. This is documented in https://github.com/ggerganov/llama.cpp/tree/master/examples/main#temperature, but admittedly this is not very intuitive behavior and it is not easy to find it in the documentation. Maybe we should just add a parameter to explicitly enable greedy sampling instead?

justinsteven commented 15 hours ago

Hmm. I definitely don't understand samplers then, thanks for your patience :sweat_smile:

Is there a functional difference, or difference in a user's intention, between:

Is "greedy" synonymous with "no sampling, just take the most likely next token always"?

I assumed that putting temperature in the sampler chain, and setting temperature to 0, makes the temperature sampler a pass-through no-op. I don't expect setting temperature to 0 to hijack the chain entirely (whether or not temperature is in the specified chain) to enable greedy behaviour, but perhaps there is a need for a consumer to signal they want this greedy behaviour. The way it's done right now surprises me as a user.

slaren commented 15 hours ago

Is "greedy" synonymous with "no sampling, just take the most likely next token always"?

Yes, that's the way the term is used here. The logic is that lower temperatures make the sampling more deterministic, so a temperature of 0 should mean that the sampling is completely deterministic. To disable temperature you would have to set it to a value of 1. This was reasonable when the entire sampling chain consisted of top-k, top-p and temperature, but as the samplers got more complex I can understand that this can be confusing.

MaggotHATE commented 15 hours ago

The problem here is that temperature = 0 is the intended and established way of defining deterministic (greedy) sampling, because in some tasks no variation is needed - and that's exactly what zero temperature represents. It is still used in practice (in fact, just in some recent papers), so I'm not even sure how to define a different way that would not break existing rules.

justinsteven commented 15 hours ago

To disable temperature you would have to set it to a value of 1.

Oh. So it's a temperature of 1 that is a no-op? Is it equivalent to not putting temperature in the sampler chain?

I assumed that 0 would be the no-op value.

I see now that the SillyTavern UI indeed says to set temperature to 1 "for the original probabilities" :facepalm:

Perhaps there's still merit to reconsidering the temperature == 0 greedy magic switch, but this seems to be user error on my part.