Closed justinsteven closed 1 hour ago
temperature
of 0
AFAIK, such value activates greedy sampling which doesn't use sampling queue. You can look in common/sampling.cpp.
It should work with temperature > 0, and I don't think that temperature 0 is default normally.
Oops. I said:
Furthermore, each of the things after my samplers array seem to individually cause XTC to not activate. For example, a temperature of 0 (without specifying any of top_k, tfs_z, top_p or min_p) is enough to cause XTC to not activate.
This is not true. I don't know how I managed to goof that testing. It seems to only be a temperature
of 0 that causes XTC to not be done.
Also this is reproducible using llama-cli
. I still have the puts("ok")
hacked in when XTC does its work.
Without specifying a temperature:
$ /llama.cpp/llama-cli -t 8 -ngl 99 -m /models/anthracite-org/magnum-v3-34b-gguf/magnum-v3-34b-IQ4_XS.gguf -c 4096 --flash-attn -p "You are a creative story writer" --sampling-seq mx --min-p 0.02 --xtc-probability 0.5 --seed 1
[... SNIP ...]
sampler seed: 1
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.020, xtc_probability = 0.500, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> min-p -> xtc -> softmax -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0
You are a creative story writerok
whook
shouldok
beok
gettingok
publishedok
!ok
"ok
ok
Sook
muchok
forok
herok
trying took
makeok
me feelok
badok
,ok
Iok
feltok
GREATok
!ok
Thanksok
forok
theok
lovelyok
boostok
ofok
confidence
llama_perf_sampler_print: sampling time = 2.27 ms / 36 runs ( 0.06 ms per token, 15824.18 tokens per second)
llama_perf_context_print: load time = 1775.83 ms
llama_perf_context_print: prompt eval time = 54.10 ms / 6 tokens ( 9.02 ms per token, 110.90 tokens per second)
llama_perf_context_print: eval time = 917.76 ms / 29 runs ( 31.65 ms per token, 31.60 tokens per second)
llama_perf_context_print: total time = 984.03 ms / 35 tokens
Interrupted by user
(The output containing lots of ok\n
is expected due to my gross debugging technique)
With a non-zero temperature specified:
$ /llama.cpp/llama-cli -t 8 -ngl 99 -m /models/anthracite-org/magnum-v3-34b-gguf/magnum-v3-34b-IQ4_XS.gguf -c 4096 --flash-attn -p "You are a creative story writer" --sampling-seq mx --min-p 0.02 --xtc-probability 0.5 --seed 1 --temp 1.0
[... SNIP ...]
sampler seed: 1
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.020, xtc_probability = 0.500, xtc_threshold = 0.100, typical_p = 1.000, temp = 1.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> min-p -> xtc -> softmax -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0
You are a creative story writerok
whook
shouldok
beok
gettingok
publishedok
!ok
"ok
ok
Sook
muchok
forok
herok
trying took
makeok
me feelok
badok
,ok
Iok
feltok
GREATok
!ok
Thanksok
forok
theok
lovelyok
boostok
ofok
confidenceok
.ok
..ok
andok
aok
reminderok
ofok
howok
Iok
spendok
myok
free timeok
!ok
Iok
amok
workingok
onok
getting
llama_perf_sampler_print: sampling time = 3.66 ms / 55 runs ( 0.07 ms per token, 15027.32 tokens per second)
llama_perf_context_print: load time = 1757.99 ms
llama_perf_context_print: prompt eval time = 53.56 ms / 6 tokens ( 8.93 ms per token, 112.03 tokens per second)
llama_perf_context_print: eval time = 1505.34 ms / 48 runs ( 31.36 ms per token, 31.89 tokens per second)
llama_perf_context_print: total time = 1576.80 ms / 54 tokens
Interrupted by user
XTC is active in both cases. In both cases, the reported "sampler chain" is logits -> logit-bias -> penalties -> min-p -> xtc -> softmax -> dist
With a temperature of 0:
$ /llama.cpp/llama-cli -t 8 -ngl 99 -m /models/anthracite-org/magnum-v3-34b-gguf/magnum-v3-34b-IQ4_XS.gguf -c 4096 --
flash-attn -p "You are a creative story writer" --sampling-seq mx --min-p 0.02 --xtc-probability 0.5 --seed 1 --temp 0
[... SNIP ...]
sampler seed: 4294967295
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.020, xtc_probability = 0.500, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0
You are a creative story writer. I enjoyed reading your story.
I am glad you enjoyed the story. I have been writing stories for a long time. I have a lot of stories to share.
I am glad you enjoyed the story. I have been writing stories for a long time. I have a lot of stories to share. I am glad you enjoyed the story. I have been writing stories for a long time. I have a lot of
llama_perf_sampler_print: sampling time = 1.55 ms / 92 runs ( 0.02 ms per token, 59469.94 tokens per second)
llama_perf_context_print: load time = 1786.14 ms
llama_perf_context_print: prompt eval time = 54.71 ms / 6 tokens ( 9.12 ms per token, 109.68 tokens per second)
llama_perf_context_print: eval time = 2680.06 ms / 85 runs ( 31.53 ms per token, 31.72 tokens per second)
llama_perf_context_print: total time = 2769.77 ms / 91 tokens
Interrupted by user
XTC was not activated. And indeed, the sampler chain doesn't include XTC and ends with greedy, so it's probably indeed the else
branch of:
Good catch @MaggotHATE, and thank you for your work on XTC, I've been following it keenly :)
However, in this specific llama-cli
case, the --sampling-seq
was mx
and so I'd argue that the temperature == 0
greedy shortcut probably shouldn't be taken.
@MaggotHATE said:
It should work with temperature > 0, and I don't think that temperature 0 is default normally.
Indeed it does work with temperature > 0
However, I'd say that someone who specifies a sampler chain excluding temperature should never send llama down the temperature == 0
greedy path (see above).
Furthermore, some clients may present a UI to the user that doesn't allow them to disable temperature. For example, SillyTavern always sends:
samplers: [
'top_k',
'tfs_z',
'typical_p',
'top_p',
'min_p',
'temperature'
],
The order is configurable by the user, but all samplers are always specified it seems. The user is directed to set sampler parameters to "do nothing" values (e.g. temperature = 0
) if they wish to effectively disable a sampler. And so if SillyTavern adds xtc
to the sampler chain for llama.cpp
consumption, XTC (and other samplers?) will silently break if the user sets temperature to 0 to try to disable it.
I tried to do this because it was my impression that the recommended settings for XTC are only min_p
and xtc
and that temperature isn't recommended - but maybe I'm mistaken there.
And so - is this working as designed? Disabling samplers such as XTC in the case that temperature == 0
was surprising to me, but now that I know there is this greedy sampling behaviour I'm guessing it's an optimisation. In light of new samplers such as XTC, is this optimisation potentially harmful or confusing? Or is this WAD and the API of llama.cpp
is such that if consuming software wants temperature to not be involved in the sampler chain but wants XTC (or other samplers?) they must:
It is not an optimization, using a temperature of 0 is intended to enable greedy sampling. Greedy sampling implies that all other samplers are disabled, so that's what it does. This is documented in https://github.com/ggerganov/llama.cpp/tree/master/examples/main#temperature, but admittedly this is not very intuitive behavior and it is not easy to find it in the documentation. Maybe we should just add a parameter to explicitly enable greedy sampling instead?
Hmm. I definitely don't understand samplers then, thanks for your patience :sweat_smile:
Is there a functional difference, or difference in a user's intention, between:
Is "greedy" synonymous with "no sampling, just take the most likely next token always"?
I assumed that putting temperature in the sampler chain, and setting temperature to 0, makes the temperature sampler a pass-through no-op. I don't expect setting temperature to 0 to hijack the chain entirely (whether or not temperature is in the specified chain) to enable greedy behaviour, but perhaps there is a need for a consumer to signal they want this greedy behaviour. The way it's done right now surprises me as a user.
Is "greedy" synonymous with "no sampling, just take the most likely next token always"?
Yes, that's the way the term is used here. The logic is that lower temperatures make the sampling more deterministic, so a temperature of 0 should mean that the sampling is completely deterministic. To disable temperature you would have to set it to a value of 1. This was reasonable when the entire sampling chain consisted of top-k, top-p and temperature, but as the samplers got more complex I can understand that this can be confusing.
The problem here is that temperature = 0
is the intended and established way of defining deterministic (greedy) sampling, because in some tasks no variation is needed - and that's exactly what zero temperature represents. It is still used in practice (in fact, just in some recent papers), so I'm not even sure how to define a different way that would not break existing rules.
To disable temperature you would have to set it to a value of 1.
Oh. So it's a temperature of 1 that is a no-op? Is it equivalent to not putting temperature in the sampler chain?
I assumed that 0 would be the no-op value.
I see now that the SillyTavern UI indeed says to set temperature to 1 "for the original probabilities" :facepalm:
Perhaps there's still merit to reconsidering the temperature == 0
greedy magic switch, but this seems to be user error on my part.
What happened?
It's possible I'm misunderstanding samplers and sampler parameters.
It's also possible this is a symptom of a larger problem, where "default" values for some samplers may cause other samplers to be not activated inllama-server
.Observed behaviour
There are several sampler parameters which, when given "do nothing" default values viallama-server
's/completions
API, seem to cause the XTC sampler to not be used.Update: The above should be "Given a temperature of 0, even if temperature is not in the requested sampler sequence, the XTC sampler is not used"
The following JSON payload demonstrates the issue:
In my testing, this causes the XTC sampler to not be activated. The vibe was off, and the following hacky debugging that I added was not activating:
Given the following simpler JSON payload, the hacky debugging was successfully activated:
Furthermore, each of the things after my samplers array seem to individually cause XTC to not activate. For example, atemperature
of 0 (without specifying any oftop_k
,tfs_z
,top_p
ormin_p
) is enough to cause XTC to not activate.There may be other parameters, including sampler parameters, which cause XTC to not activate, but which I did not test.(Update: I was wrong about this, it seems as though only
temperature == 0
reproduces the issue)This is problematic for clients such as SillyTavern, which seem to always send all samplers in the array but which rely on sending default parameters (e.g. 0 in the case of
temperature
) to cause them to be effectively disabled. Such a client willnever be able to activate XTCnot activate XTC if the user gives a temperature of 0 in the hopes of disabling the temperature sampler.Expected behaviour
If XTC is in the samplers array, and
xtc_threshold
andxtc_probability
meet the criteria for XTC to be used, XTC should be used regardless of parameters for other samplers.More generally, if any sampler is in the samplers array, and its parameters meet the criteria for it to be used, it should be used regardless of parameters for other samplers (?)
Related
9742
Name and Version
What operating system are you seeing the problem on?
Linux
Relevant log output
No response