ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.4k stars 8.78k forks source link

[Feature Request] Dynamic temperature sampling for better coherence / creativity #3483

Closed kalomaze closed 3 weeks ago

kalomaze commented 9 months ago

Prerequisites

Feature Idea

Typical sampling methods for large language models, such as Top P and Top K, (as well as alternative sampler modes that decide the Top K dynamically like Mirostat) are based off the assumption that a static temperature value (a consistently randomized probability distribution) is the ideal sampler conditioning. Mirostat, most notably, was designed to 'learn' a certain targeted level of 'entropy' over time; this helped the model find the most grammatically coherent selection of tokens to be considered by the sampler for good results. Most of these sampling implementations weren't designed to be used together. Some, like TFS, were created when the largest available models were smaller ones like GPT2. Those models struggled a lot more when attempting to generalize in different directions, and it makes sense to me that they'd need unique sampler tricks to keep them grammatically coherent.

I've tested and played around with these settings for Llama models, and while Mirostat seemed like a step in the right direction, especially for preventing repetition, I realized that nobody had made a sampler mode that would control temperature directly per token. My implementation of this would be calculated based on a simple metric; take the standard deviation of all tokens being considered by your top P / top K before applying the temperature randomization, and based on the 'confidence' of the model (as represented by the variation in choice), you can apply a temperature adjustment proportional to the variation of probability seen in the sampled set of tokens being chosen from.

The main idea is to encourage randomizing 'uncertain' probabilities (e.g, open ended writing, abstract concepts that can be represented with many words, and aren't deterministic by nature) while keeping the temperature low for more deterministic tokens without having to find the ideal selection of candidates for sampling per token (which I believe is how Mirostat was designed to work).

List of possible advantages could be:

List of possible disadvantages could be:

kalomaze commented 9 months ago

cc @KerfuffleV2 I think this is a more realistic sampler modification to implement compared to my last issue, do you have any opinions on this?

Zhuyuqii commented 9 months ago

https://arxiv.org/abs/2309.02772

KerfuffleV2 commented 9 months ago

I think this is a more realistic sampler modification to implement compared to my last issue, do you have any opinions on this?

Definitely way more practical and the use case is also clearer to me than what you were talking about before.

I also had the idea to do something similar with word boundaries. I.E. if you're generating something like "that is wh" then the temperature for tokes like en, ere, at shouldn't necessarily be the same as dog since [wh]en], [wh]ere, etc. Also if you have that is a token like en or ere shouldn't necessarily have the same temperature as something like n't, what, etc. So it matters if you're in the middle of a word, and whether the token under consideration is something that would complete a word or start a new one.

kalomaze commented 9 months ago

I've been working on drafting this. Here's an interesting example of the Declaration of Independence and the measured top token probability for the next sentence when I gave it half of the first paragraph on Mistral 7b:

image

As you can see it is not deterministic enough with a low-ish temp sampler config to prevent hallucinations in quotations from being reasonably possible; some tokens are more like 95%, 90%, rather than 99.9% as someone I talked to theorized would be the case (and that it would only be that instead of 100% because it has to avoid dividing by zero).

Curiously, here's a natural language prompt:

image

Quite bizarrely the variance is incredibly high for natural language. Some are quite undecided and go as low as 13% when it comes to their top token probability; some are very obvious (89%) in comparison.

Standard deviation was proposed at first here to help measure 'confidence' and scale temperature accordingly, but that was more of a hunch and not necessarily the 'best' idea on how to implement the dynamic temp.

There are a multitude of ways we could measure and score how confident a model is at predicting:

I will continue to update this if I make significant progress.

KerfuffleV2 commented 9 months ago

Are those values after softmax? If not, comparing the absolute values between different runs might not really be meaningful. It's the logit value relative to other logits that determines which token gets picked, not really the absolute value.

You didn't show the code or process you used to generate that output, so it's hard to comment.

kalomaze commented 9 months ago

It was called right before temp sampling and with the other samplers (top p, etc) disabled but that might not have been completely accurate still, you're right on that. Though I didn't change the sampling settings between those two responses...

kalomaze commented 9 months ago

I have a test implementation of this feature hard-coded right now in this GUI fork of llama.cpp (koboldcpp): https://github.com/LostRuins/koboldcpp/pull/464

I am calling it 'greedy dynamic temperature'. This is because I'm only taking the top token's probability % and scaling the temperature value based on that with an exponential curve. It's applying that scaling curve to decide the temperature value, so that high probability values like 90% and 100% are both close to the minimum temperature value, but 40% and 50% is a more pronounced difference (the lower you go in confidence, the closer you are to the max temperature value) I've labeled this approach as 'greedy' as it solely relies on the top token's probability for the adjustment. But that could be all we need...

It seems to be doing decently well so far with the provided test values (min temp 0.1, max temp 1.5) when I tried its ability to continue long-form text. By that, I'm referring to completing partial passages of text that LLMs have 'memorized perfectly' (things like the Declaration of Independence as I mentioned), on a non-instruct model (just for testing). It is also doing creative / open ended text generation properly and I'm not seeing much repetition there.

Will do more tests and a more 'proper' implementation of it so that this is its own option and not hardcoded. Then, if it has a good reception, I will consider a PR on the main repository here. If not, I will rethink the approach of using only the top token.

KerfuffleV2 commented 9 months ago

I'm a bit confused by that code. The candidates aren't sorted until you call either llama_sample_softmax or llama_sample_top_k. Also there are other samplers that can change the order so they'll only be sorted if one of those two functions got called and no other sampler that changes the order was called afterwards. There's also a candidates->sorted flag which you can use to check if they're sorted. You can't necessarily assume they'll just always be sorted. For example, when using mirostat samplers everything else except temperature gets skipped, so in that case the temperature sampler will get called first, then the mirostat sampler - so the logits won't be sorted at that point.

What I'd do is just call the softmax sampler since you want the softmax value anyway. Then you'll know the logits are sorted and candidates->data[i].p will have the softmax value.

Also, from the pull:

float prob_max_token_before_temp = expf(max_l - max_l) / sum_exp;

In other words:

float prob_max_token_before_temp = expf(0) / sum_exp;

Right? max_l - max_l has to be 0 (except if it was NaN but that shouldn't be a case you run into).

kalomaze commented 9 months ago

I was not intending for this to have full compatibility with the other samplers until I could confirm that it was working to some degree, at that point I was going to make sure that it worked in tandem with different samplers (I'm prototyping without knowing a very good knowledge of the general codebase) Also I'm currently investigating a potential issue with how it scales (in terms of the curve) so that will have to wait. But thank you for pointing it out still. As soon as I ensure that I'm scaling in the way I initially intended I'll try to figure out how to make sure Mirostat is disabled when this is on and that softmax has been called when measuring.

Also yeah that max_l minus max_l is probably redundant

KerfuffleV2 commented 9 months ago

I'll try to figure out how to make sure Mirostat is disabled when this is on

You don't need to do that, you just need to run llama_sample_softmax rather than assuming the logits are already sorted when your sampler is reached.

In other words just add llama_sample_softmax(nullptr, candidates); after the line near the top when you start timing the sampler.

Then you'll have the logits nicely sorted and the softmax values available in .p

kalomaze commented 9 months ago

I'll try to figure out how to make sure Mirostat is disabled when this is on

You don't need to do that, you just need to run llama_sample_softmax rather than assuming the logits are already sorted when your sampler is reached.

In other words just add llama_sample_softmax(nullptr, candidates); after the line near the top when you start timing the sampler.

Then you'll have the logits nicely sorted and the softmax values available in .p

Thank you!

Also, this test is not very comprehensive and isn't the most accurate, especially with a GPT4 judge, but the intention was to see if there was any discernable general trends even without a lot of data (and quickly without having to research quote origins...)

image

However, it has useful data, if a bit misleading if taken too literally (like most LLM benchmarks). The rate at which it just completely makes up quotes was much higher in non-dynamic sampling (of course, both attempts had a bunch of well known misleading quotes - but for the purposes of what this test was benchmarking, that wasn't strictly important)

I used a pretty uninteresting Mistral finetune that typically suffers from sampling issues for my testing, because I notice with a lower temp (e.g 0.6), it starts repeating and hallucinating. I chose 0.9 to avoid that here for the non-dynamic temp. 2.0 max temp with 2 k scaling value doesn't struggle with that nearly much on the same exact model, from other anecdotal testing (again, not thoroughly benchmarked, just testing the waters here)

Here's how that scale looks on a graph:

image

And the formula for it:

image
kalomaze commented 9 months ago

The latest commit takes a different approach for the formula now, where it is being represented as a sigmoid function. Two presets are mapped to '1.93' and '1.94' values of temperature for now, and will be overridden if those values are used (temporarily) until a proper full implementation is put into place.

image

This is the basic test preset for 1.93. 1.94 uses a max of 2.0 instead, and will be more dramatic temp scaling wise.

kalomaze commented 9 months ago

The test build of Koboldcpp is up: https://github.com/kalomaze/koboldcpp/releases/tag/dynamic-test

I've been getting a very positive reception so far, but the actual values used probably need better calibration.

kalomaze commented 9 months ago

image Entropy sampling! My math might be wrong here in some fashion (the idea of Shannon entropy is new to me, as well as C++ in general... so bear with me lol) but I noticed that this implementation is working well so far from basic testing.

The concept is essentially to ensure that, when there's total evenness in the probability distribution, let's say 'perfect' evenness, that should theoretically scale all the way to 2.0 temperature (or whatever you set as maxTemp). The inverse of this, which is full confidence in just the top token / next to no variation at all, would have nearly 0 temperature.

This is to avoid the pitfalls of relying on a single top token: image

I haven't pushed it yet, but I'll double check to see if my implementation for this is better than what I came up with before or if it needs adjustments implementation wise.

kalomaze commented 9 months ago

image Static Top K of 40 was used, no Top P (or other samplers) were used whatsoever.

This comparison is somewhat misleading, because whether or not it uses the pattern of Celebrity - Birthday vs Birthday - Celebrity absolutely matters semantically in terms of how the model learned... but even that proves that just one bad prediction that's allowed to happen because of a sensitive temperature, can eventually lead to many bad predictions / compounding failure. Also notice how it ends abruptly when using 1.0 temp.

kalomaze commented 9 months ago

image The green line here represents me asking it to generate Python code, in which multiple stopping points to elaborate on what the code was doing were made. The blue line represents an Essay generation. Both were stopped at 500 tokens. You can observe that there's overall more certainty across the predictions on average for a code generation compared to the open-ended essay generation. image

The model I am using is not finetuned for code and is biased toward storywriting / chat, so a better comparison would be using a code llama model and comparing that to a storywriting finetune, but you can still notice an obvious trend even on the same model using different prompts. To me, this is evidence it's a solid metric for scaling temperature randomization.

kalomaze commented 8 months ago

image This might work better for entropy sampling and would be more straightforward to adjust. There'd simply be min and max temperature and the very start of the curve would decay quickly. Will test this out today

kalomaze commented 8 months ago

After some more research trying to figure out how to properly score distributions where there are many 'bad candidates', I discovered that the Gini coefficient is a way to directly mathematically measure inequality in a distribution. Using that as the measurement might be superior to entropy if we want to measure overall 'uncertainty' because it would weigh the disproportionately probable tokens as being more important in its scoring.

So for a theoretical distribution like

  1. 75%
  2. 2.5%
  3. 2.5%
  4. 2.5%
  5. 2.5%
  6. 2.5%
  7. 2.5%
  8. 2.5%
  9. 2.5%
  10. 2.5%
  11. 2.5%

It would reward a lower value to this than entropy would because entropy would care about the sum of lower probability values. Gini is biased towards higher probability values in the distribution when calculating its value, which is theoretically better for this use case.

image

Also, I switched to a power function which seems reasonable / simpler for experimenting compared to a sigmoid: image

I will be updating this page on my efforts / progress implementing dynamic temp for those interested: https://rentry.org/dynamic_temperature

kalomaze commented 8 months ago

I also had the idea to do something similar with word boundaries. I.E. if you're generating something like "that is wh" then the temperature for tokes like en, ere, at shouldn't necessarily be the same as dog since [wh]en], [wh]ere, etc.

I'm guessing the idea here is conditional determinism? Like whenever it starts a piece of a larger word you might want to ensure it to finishes that word with a higher degree of determinism / lower temperature rather than creating a pseudo-word. If so that reminds me of the AdapT paper that was posted at the start of this issue where they were trying to find arbitrary 'conditions' to trigger a different temperature. Which does work, but I'm thinking a generalized dynamic temp would be best.

Also, I've posted another koboldcpp build where you can try out the Gini sampling approach (as well as Entropy sampling, and the original DynaTemp implementation, but Gini seems superior). I've gotten positive feedback so far.

kalomaze commented 8 months ago

image

I did an experiment where I turned off Top K and Top P. No other samplers beyond dynamic temperature, and from 0.0 temp to 2.0 temp (linear mapping of HHI, which is sum squared probabilities, this metric is used to measure concentration of the probabilities). All 32,000 tokens were considered for the test. Strangely enough, the generations were either totally coherent and creative, or coherent for a bit but then started repeating 'nonsense'. So I measured the HHI distributions of both.

Coherent Generations:

Incoherent Texts:

This is very interesting to me to see how one bad token chosen that is nonsensical can totally break the rest of the generation. I wonder if a running HHI measurement could be ran which dials back the temperature scaling whenever it shifts too far from the mean could help prevent this... (may not be worth it compared to just using Top P / Top K, but one might hope a universal sampler could exist)

kalomaze commented 8 months ago

I'm getting very good results with my Min P sampler implementation.

https://github.com/kalomaze/koboldcpp/releases/tag/minP https://github.com/kalomaze/text-generation-webui/releases/tag/minp-exllama2

image

This is pretty close to the niche 'Top A' sampling method except scaled linearly, which seems a lot more appropriate considering the probability distributions I've measured. I can say with confidence that this generally isolates the best tokens for sampling more consistently than how Top P currently samples. Let me break this down:

Let's say in theory we have this distribution (assuming 1.0 temperature):

If Top P is used with a distribution like this, with a typical value such as 0.90, that means that in theory, Top P would include most of the 1% probabilities reaching for the total sum of 90%, making a bunch of low quality choices very likely. In practice, this does happen but to a less exaggerated extent; however, when the chance of choosing tail probabilities happens every token (and is sometimes exaggerated with temperature scaling), this eventually leads to compounding failure.

Min P works differently. Let's assume my default of 0.05 (5%) value of Min P is used. For the same probability distribution:

0.05 would be scaled by the top probability as expressed as a decimal, 0.25. So 0.05 x 0.25 = 0.0125 (1.25%). Therefore, only probabilities over 1.25% would be chosen.

Math wise, it seems to handle probabilities much better on average if our assumed goal is to cut out the tail end of the distribution in a simple and effective manner.

Also, this isn't really 'evidence' as much as it is 'opinions from people I trust to give good subjective analysis', but I am hearing positive reports on my Min P test build for koboldcpp,.

image image

This is on top of my console logging showing that the method is cutting out the tail end very reasonably across probability distributions in a similar fashion to Top P:

image

I will put up a PR with the current code asking for feedback on how to properly integrate this within llama.cpp. Dynamic Temp will probably stay as an experimental side project for now, but Min P as a sampler option seems immediately relevant and useful.

Also, I would just call this "linear version of Top A" or something along those lines, but the problem is search engines do not like "Top A"... so I think a rename to something more distinct for the implementation is in order for the sake of it accessibility. I am willing to hear other interpretations of how this could be named beyond "Min P".

image

KerfuffleV2 commented 8 months ago

The "cutting the tail" part sounds very similar to the tail free sampler. Have you already looked at that? (Locally typical also isn't completely different either.)

  1. https://trentbrick.github.io/Tail-Free-Sampling/
  2. https://arxiv.org/abs/2202.00666
kalomaze commented 8 months ago

The "cutting the tail" part sounds very similar to the tail free sampler. Have you already looked at that? (Locally typical also isn't completely different either.)

  1. https://trentbrick.github.io/Tail-Free-Sampling/
  2. https://arxiv.org/abs/2202.00666

I indeed have looked at TFS & Typical sampling. They did not get me the results I was looking for in this department, and the results of of how the values impacted those samplers didn't seem very easily interpretable, making them difficult to use as hyperparameters. I think for typical sampling specifically, a big problem is that it presumes "uncertain distributions mean that the generation is becoming atypical" rather than being open ended and having many valid choices.

I will admit that, TFS seems quite mathematically dense and I don't fully understand it, and I didn't get good [subjective] results back when I tested different values. However this was when the Top K clamping bug was still in koboldcpp, so I'm not sure if that being fixed might affect the calculations of the derivatives in any meaningful way.

This might be bias speaking, but I think I'm a fan of Occam's razor when it comes to sampler designs; it should be somewhat intuitive what it is directly accomplishing, otherwise the parameter isn't reasonably interpretable to configure for different scenarios (e.g deterministic, creative...) and it doesn't really see adoption.

image

Also consider that the Rate of Change as a metric (seems to be?) a much messier metric to use on today's models compared to what existed at the time when TFS was created (GPT2); It seems less predictable of a metric to use across probability distributions on modern Llama models because the rate of change could be erratic or smooth? (Not confident about this, could be very wrong, calculus is not my strong suit lol)

KerfuffleV2 commented 8 months ago

It should be a very fast sampler. It just does softmax + sort (most existing samplers do this also), and then the worst case is to iterate the logits once. I doubt it would even have a measurable performance impact.

kalomaze commented 8 months ago

I just tried the exllamav2 implementation and it's very lit. I need to test if it hurts generation speed but the results are almost worth it. The writing is much more creative and really comes alive.

The performance impact is not measurable. Sampler math tends to be extremely lightweight. There is an exception though. The GBNF Grammar sampler has some nested recursion atm, I unfortunately get huge degradation in token gen speed when using it (I also notice that it seems to randomly 'force' certain tokens to be chosen in or conditions, perhaps a refactor is in order for that...)

image

KerfuffleV2 commented 8 months ago

I also notice that it seems to randomly 'force' certain tokens to be chosen

It just bans tokens that don't match the grammar, it never makes any tokens more likely.

Green-Sky commented 8 months ago

3841 got merged. Please test :)

KerfuffleV2 commented 8 months ago

@kalomaze Just in case you're interested, I'm adding your Min-P sampler to my Rust sampling crate: https://github.com/KerfuffleV2/llm-samplers/pull/9 (with appropriate credit of course)

I've been using it lately it seems very useful. I used to use TFS and tail-free, top-k, top-p in combination but now I'm able to disable those and min-p produces pretty much equivalent results.

I know you developed this independently, but BlinkDL's Top-A sampler idea is pretty similar: https://github.com/BlinkDL/RWKV-LM#the-top-a-sampling-method - it just uses a formula for the "you have to be this tall" threshold instead of a flat value. I'm not sure which approach works better.

KerfuffleV2 commented 8 months ago

When using min-p, we should always disable topk/top-P?

There's no inherent conflict. Those both run before min-P though and softmax runs again so you may need to keep that in mind when tuning the the min-P threshold.

kalomaze commented 8 months ago

@kalomaze Just in case you're interested, I'm adding your Min-P sampler to my Rust sampling crate: KerfuffleV2/llm-samplers#9 (with appropriate credit of course)

I've been using it lately it seems very useful. I used to use TFS and tail-free, top-k, top-p in combination but now I'm able to disable those and min-p produces pretty much equivalent results.

I know you developed this independently, but BlinkDL's Top-A sampler idea is pretty similar: https://github.com/BlinkDL/RWKV-LM#the-top-a-sampling-method - it just uses a formula for the "you have to be this tall" threshold instead of a flat value. I'm not sure which approach works better.

Yup, Min P is essentially a linear Top A. Looking at actual distributions, I think it makes more sense to just linearly scale a 'floor' (required probability) based on the 'ceiling' (top token probability). It's more directly interpretable that way (e.g, 0.25 Min P is 'you must have at least 1/4th the probability of the top token').

I think it's fair to say that Min P outperforms TFS / Top P; TFS relies on the rate of change, which can be rocky on modern models, and Top P isn't considering the possibility of divided concentration of a few good choices amongst a sea of bad ones...

pacmanincarnate commented 7 months ago

As we reduce the pool of tokens (such as with min-p) we also increase the likelihood that the top token will be selected. I think there’s an issue with this. We only really get options when we have a large token pool, but then there’s a higher likelihood that the selected token will be of much lower quality. I think ideally we want to have a small pool of relatively high probability tokens and a higher likelihood that a non-top token will be selected. Temp helps with that, but not necessarily at the same rate that changes in pool size impact it, so right now there’s not a lot of control over the final token likelihood.

So, I think we need to combine something like min-P with something that dynamically adjusts temp based on some combination of pool size and something like the difference between top token probability and the next token probability.

thoughts?

KerfuffleV2 commented 7 months ago

It sounds like you kind of want to take min-p and then normalize the top X tokens so they have about the same priority as the top one? Or at least so the probabilities are closer?

Green-Sky commented 7 months ago

To me that sounds like doing a softmax after sampling and then maybe some >1.0 temperature

pacmanincarnate commented 7 months ago

Kerfuffle, yes essentially. I don’t think we want them to be equivalently likely as that is likely to lead to craziness, but we want the curve flattened quite a bit.

Once we’ve min-p’d the list, we should be relatively happy with the selection. I feel like, in reality we never really need more than the top handful of tokens as options but we need those tokens to be fairly possible to choose between.

PhorKoz, do you find that method is giving you some a reasonable chance to not select the top token without overly reducing its likelihood too much? I’m not sure if the dynamic temp sampling method got moved forward at all or dropped when Kalomaze moved on to min-p.

DutchEllie commented 6 months ago

@kalomaze You have since implemented this in koboldcpp, right? Can you upstream that?

Green-Sky commented 6 months ago

@DutchEllie what specifically? min-p is in master. (and used by default) https://github.com/ggerganov/llama.cpp/blob/8c5833031857c9e9ada61948bae894ab9c785f86/common/sampling.h#L17

DutchEllie commented 6 months ago

@DutchEllie what specifically? min-p is in master. (and used by default)

https://github.com/ggerganov/llama.cpp/blob/8c5833031857c9e9ada61948bae894ab9c785f86/common/sampling.h#L17

Kalomaze introduced min-p, but he also provided a DynaTemp implementation in koboldcpp (fork of this) recently. From what I hear it's quite good, so if it could be merged upstream here that'd be nice.

igorbarshteyn commented 4 months ago

Would it be possible to add @kalomaze's Cubic Sampling with Curve Params that he put up in text-generation-webui? I'm hearing people are getting good results: https://github.com/oobabooga/text-generation-webui/pull/5551

joshknnd1982 commented 3 months ago

In the latest "llama cpp" how do I use the "--dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled)" parameter? Suppose I want a range between 0 and 1? Is the correct format --dynatemp-range 0,1 ? I'm a bit confused as to how to use this with llama cpp command line and with batch files.

NeedsLoomis commented 2 months ago

In the latest "llama cpp" how do I use the "--dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled)" parameter?

From the source, it appears to be a single +- value. I assume it would work like:

--temp 0.7 --dynatemp-range 0.3

That should give a range of 0.4 - 1.0

l3utterfly commented 2 months ago

Yes, that's exactly how it works

ZoomRmc commented 2 months ago

I'm a bit confused as to how to use this with llama cpp command line and with batch files.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 14 days since being marked as stale.