ggerganov / llama.cpp

LLM inference in C/C++

MIT License

65.73k stars 9.44k forks source link

Making perplexity "comparable" between models with different vocab: per-byte perplexity #7111

Closed turian closed 2 months ago

turian commented 4 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Instead of computing per-token PPL, llama.cpp should compute per-byte character PPL.

For example, Llama 2 has 32K vocab and Llama 3 has 128K vocab. Llama 3 per-token perplexity scores are much higher, but that's because a Llama 3 token predicts more underlying (original, pre-tokenization) bytes.

Thus, instead of computing the per-token Perplexity (or per-token NLL), it's much more general and comparable to compute the per-byte Perplexity (or per-byte NLL).

Motivation

@bloc97 points out in a large Llama 3 vs Llama 2 PR #6936 that perplexity (PPL) computations aren't comparable for different models: "Just a word of caution: comparing perplexities across models with different token dictionary sizes is meaningless because a bigger dictionary means each token is "harder" to predict, resulting in a naturally higher PPL. Also since the total amount of tokens for a given prompt is also different between Llama 2 and 3, the running average PPL is meaningless too."

And I, who have been doing this stuff for a while (see my Google scholar), found the non-comparable nature of PPL scores for different models surprising.

Possible Implementation

I propose perplexity.cpp changes the measure name to bytePPL or bPPL so people adapt a new name for the new measure, and don't compare apples to orange (which is what this Issue is about). Also, perhaps PPL in the documentation is changed to tokPPL to be explicit that these are old, not so comparable, scores.

In static results_perplexity perplexity we need to pass in information about the tokenizer, to recover the bPPL. I don't LOVE the idea of passing in the original text length. Like, that would be nice, but the windowing code is very fiddly, so it's hard to predict how many text characters actually were used because of how window truncation happened. Changing the stride makes the problem even worse. It's hard for me to grok entirely what this method is doing in all sections, like why there's a 3-nested for loop over n_chunk, n_batches, n_seq_batch, and then in the n_chunk loop we loop over n_seq_batch. If this is all correct, we count just updated count using not number of tokens but text length.

I think the easiest and perhaps more didactic and easy-to-flag-off (to replicate the OLD behavior) approach would be to compute: # bytes / #tokens using the entire text, and then pass this bytes_per_token to perplexity. Then, every time we / count, instead we do / count * bytes_per_token, or use a subroutine that is nicely named.

I'm open to both approaches tho.

turian commented 4 months ago

Let me also note:

This is a BIG step in the right direction of comparable models with different vocab sizes. But, it is not perfectly comparable for models with different vocab sizes (or different n_ctx used instead of perplexity.cpp's default 512, or a different ppl_stride than perplexity.cpp's default of n_ctx//2).

This is because of the optimizations perplexity.cpp uses: It uses n_ctx token windows, it strides windows by ppl_stride or n_ctx//2, and it only uses the logits from the second-half of the window (conditioned on the first half) to compute the NLL and perplexity.

What this means is that if you change the n_ctx, or change the vocab_size, or other windowing parameters, that the precise part of the text used for prediction + scoring and its conditioning text can change.

I think this effect will be relatively minor, compared to the tokPPL vs bPPL fix I propose above.

The ideal would be to do something described where llama.cpp's README links to a HuggingFace perplexity article. It describes predicting the NLL of each token given the full (n_ctx-1) context is the closest (and perhaps ideal) way to compute the true perplexity. However, this is SLOW because we have to predict every token's logit using the full context. But this would most comparably sample the document for comparison.

I think it's fine overall that there are different ways of sampling the underlying corpus because they are relatively unbiased. Only for short documents with very few windows would the scores not be comparable, or if the context size is really large compared to the document length (which is increasingly true with 128K context models, etc). I include this discussion because I think the perplexity/README.md needs an update after the most recent PR with information from this comment and the original post.

[edit: the half-window-score approach is worth discussing if we truly want to future proof perplexity, if we really want to compare wildly varying context size models. It would probably be best to move to a fixed number of characters or tokens to predict, like 32. I can discuss more if people are interested]

turian commented 4 months ago

@JohannesGaessler @ikawrakow I know you are both interested in perplexity discussions, so I thought I would share this.

JohannesGaessler commented 4 months ago

I don't LOVE the idea of passing in the original text length.

You don't have to. All you would have to do is convert the token ids back to strings using llama_token_to_piece (see common/common.h) and then you can simply sum up the length of the token vector.

In any case, do you have any theoretical or empirical evidence that suggests that normalizing PPL to text length leads to values that are actually comparable?

turian commented 4 months ago

You don't have to. All you would have to do is convert the token ids back to strings using llama_token_to_piece (see common/common.h) and then you can simply sum up the length of the token vector.

I am not sure. Are all tokenizers such that even if you take a truncated window (second half of the window) and detokenize the tokens, that it forms valid text?

In any case, do you have any theoretical or empirical evidence that suggests that normalizing PPL to text length leads to values that are actually comparable?

Measuring per-token perplexity is buggy, and we know that through theoretical arguments (like mine and @bloc97's) and empirical evidence (your PR).

It is difficult to say if converting to per-byte (or per-character if we wanted it to be character encoding agnostic) would be a complete debug of the "different vocab sizes means perplexity is not comparable" problem. Since I believe comparing different window-length is also an important component of model evaluation.

What is clear from a long body of work is that compression and AI are very intimately linked.

The best recent work is Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression", quotes:

Compression metrics are widely used in benchmarking language modeling (Radford et al., 2019; Dai et al., 2019) and are shown to be strongly correlated with generalization ability and in-context learning performance (Delétang et al., 2023; Rae, 2023).

1) Models’ compression performance over time correlates closely with their training data collection time, with clear divergences after the cutoff date. 2) Models with similar performance on the training period can demonstrate widely different generalization results on new, unseen data. 3) Generalization difficulty varies across test datasets. Models struggle to generalize on wikitext, news, and code, but generalize well on arXiv content. 4) All models fail to compress multi-modal data, showing limited capabilities in dealing with raw byte streams. 5) Larger contexts generally lead to better performance but do not exceed a small context + sliding window approach. 6) Models with larger vocabularies in tokenization have more difficulty with token-level prediction.

It is well established that compression is essentially prediction, which effectively links compression and langauge models (Delétang et al., 2023). The source coding theory from Shannon’s information theory (Shannon, 1948) suggests that the number of bits required by an optimal entropy encoder to compress a message ... is equal to the NLL of the message given by a statistical model.

So the point is: the NLL of a message = data compression = generalization.

Perplexity is NLL normalized by message/document length so that model scores can be compared for different document lengths too, not just for the same document. So it should be normalized not by tokenization (which a modeling choice) but by the actual document length. i.e. bytes or characters.

Per-character perplexity and bpc (bytes per character) compression ratio are mathematically interchangeable. (This is again per-character vs. per byte discussion.) It might actually be most interesting to measure LLMs using bpc not perplexity, just because the bpc is easier to understand intuitively. But the exponentiation of perplexity also makes larger qualitative difference look important, whereas these might correspond to what looks like a very small drop in bpc. Note that bpc could be computed in a closed form just based upon ideals, not actually implementing arithmetic coding like in this article.

Actually I started digging into perplexity because of all the discussions of leaderboard manipulation. I started with a similar thesis and later discovered this work, that a LLM can generalize well the less "surprised" it is by what happens in the near future after the knowledge cutoff date. Like an expert ML person would. be less surprised by the conference results published in the next few months than a junior ML person, because they can generalize their knowledge of the field better to predict the future.

I'll also share another work I found, but that is less relevant to this discussion Wang et al 2024, "Perplexity by PLM Is Unreliable for Evaluating Text Quality". They examine the ability to measure an LLM's quality by measuring the perplexity of the LLM output (not of held-out text). This is gameable for many reasons, doesn't indicate relevancy to the prompt, etc. Versus the idea of seeing how perplexed an LLM is by a held-out document or, better yet, developments in science, economics, politics, art, etc that happened after its knowledge cut-off date. I did appreciate Wang et al (2024)'s work for thinking about important empirical conditions to test, like document length and punctuation style.

Anyway, going back to your big question of: How do you make values of a score comparable across different conditions (vocab size, context size, etc), the first work (Li et al 2024) demonstrates a rigorous approach to this question. Like any sort of debugging: 1) If you find a bug and you have a reasonable fix, you apply it. In this case, we know that per-byte or per-character (but not per-token) perplexity directly corresponds to compression ratio and the bug can be directly fixed. 2) You find places that measure contradicts prior knowledge or intuition. For example, Li et al 2024 start with the intuition that things further in the future from the knowledge cutoff should have worse scores. They find this to be true in their experiments (Figure 1). But if you observe a measure contradicting some strong prior you have, that's a point to dig into. 3) You compare to other benchmarks. Other benchmarks might have issues like being expensive to construct or biases (the worst being used as training data). But they are still important circumstantially. e.g. I have new measure 1 I'm proposing and new measure 2 I'm proposing. If 1 and 2 correlate well with benchmarks, except under certain conditions, then we examine the conditions better to understand what is truly going on. 4) You imagine ways to break the model, i.e. I have intuition that longer context models should usually perform better than models from the same family and less context. I imagine I could violate this intuition because of the half-windowing estimation of perplexity. This attack on the model could be tested, to see if the intuition violation happens, in which case further investigation would be needed. If not, then the measure and its windowing approach stands for now.

JohannesGaessler commented 4 months ago

Measuring per-token perplexity is buggy, and we know that through theoretical arguments (like mine and @bloc97's) and empirical evidence (your PR).

I don't see an issue with the values themselves, only with their interpretation.

In any case, thank you for the high-quality post. I'll look into an implementation.

turian commented 4 months ago

@JohannesGaessler The issue is that people compute Lllama 2 and Llama 3 perplexities and don't really the comparison between the two numbers is broken. Because most people (yourself and myself included, up until a day ago) are not aware of this bug. And silent bugs are the worst.

Literally, right now perplexity is really only truly useful for models with the same vocab and the same context. So if we put a big warning in the README files and say that, then okay. Like showing perplexity for different quants. But even this isn't great because you start remembering wikitext perplexity numbers and using them as a yardstick on newer models like Llama 3.

I ran perplexity on a wide variety of different models, similar to Li et al 2024 but with more models last week. And the results were sometimes complicated and counterintuitive. I spent the last week trying to figure things out, only to realize: 1) There is a recent tokenization bug in llama.cpp from a week ago that makes old GGUF files do broken tokenization. 2) That per-token perplexity is wrong for comparing models with different vocabulary sizes. Bug 2 might have taken me MONTHS to figure out if I hadn't read that small comment at the end of your PR. And that's not even before I get into the interesting questions about the real underlying reasons why model X is counterintuitively scoring better than model Y. That's just debugging.

Basically, my argument is that scoring measures are such a fundamental tool for model builders, that it's worth putting the time in to get them right and remove as many footguns as possible.

ggerganov commented 4 months ago

Pt. 2 is not a bug - it's how perplexity is defined. I thought it is obvious that one should not compare different vocabs, but I guess we can add a warning in the README

JohannesGaessler commented 4 months ago

@ggerganov I recently revised the perplexity README and explained the metrics, it explicitly mentions that perplexity is not comparable for different tokenizers.

JohannesGaessler commented 4 months ago

Literally, right now perplexity is really only truly useful for models with the same vocab and the same context. So if we put a big warning in the README files and say that, then okay.

The README currently reads:

The perplexity example can be used to calculate the so-called perplexity value of a language model over a given text corpus. Perplexity measures how well the model can predict the next token with lower values being better. Note that perplexity is not directly comparable between models, especially if they use different tokenizers. Also note that finetunes typically result in a higher perplexity value even though the human-rated quality of outputs increases.

turian commented 4 months ago

The perplexity example can be used to calculate the so-called perplexity value of a language model over a given text corpus. Perplexity measures how well the model can predict the next token with lower values being better. Note that perplexity is not directly comparable between models, especially if they use different tokenizers. Also note that finetunes typically result in a higher perplexity value even though the human-rated quality of outputs increases.

Thanks for that. I had been mainly looking at the main README.

Pt. 2 is not a bug - it's how perplexity is defined. I thought it is obvious that one should not compare different vocabs, but I guess we can add a warning in the README

@ggerganov I think a warning in the main README is important for the following reason: 1) I don't think it's that commonly known. @JohannesGaessler who worked extensively on the perplexity code didn't appear aware at the beginning of his PR that Llama 3 and Llama 2 perplexities were not directly comparable. (This is just a guess, @JohannesGaessler correct me if I'm wrong.) My personal experience is that if I pick up a new measure and read about it in a variety of places and work with it for a week without discovering important issues with it, then that knowledge isn't common. 3) In discussions of perplexity, this gotcha / caution is not often pointed out. The HF articles goes across a few different gotchas but not this one.

According to a strict interpretation, yes historically perplexity is measured on a per-token basis, and has been widely used in language-modeling for over 30 years. With that said, historically almost all things had the same tokenizers. It really hasn't been until the past two or three years that we have been advancing LMs so quickly that different tokenization is an important experimental variable that must be controlled for.

Per-token perplexity is historical is only because tokenizers stayed fixed for five or ten year experimental cycles. That's no longer the case and for that reason, I think the score needs revisting. This is also suggested by recent work on evaluating evaluation, like Li et al 2024.

JohannesGaessler commented 4 months ago

@JohannesGaessler who worked extensively on the perplexity code didn't appear aware at the beginning of his PR that Llama 3 and Llama 2 perplexities were not directly comparable. (This is just a guess, @JohannesGaessler correct me if I'm wrong.)

I was most definitely aware of it.

JohannesGaessler commented 4 months ago

More generally, I'm not convinced that adding a warning to the main repository README is going to be of more use than a warning in examples/perplexity - that is where I would look for documentation about perplexity in particular. And the people that would need to see the warning the most are not the type of people to carefully read READMEs anyways.

turian commented 4 months ago

@ggerganov Let me know if I'm off-base here by expanding the scope of the discussion a bit. My sense is that this project is interested not just in implementation, but also innovation in the LLM space.

More generally, I'm not convinced that adding a warning to the main repository README is going to be of more use than a warning in examples/perplexity -

I guess I just don't see this gotcha noted in most places that perplexity comes up (the perplexity/README.md aside).

I was most definitely aware of it.

Okay.

I guess I was aware of this too, but I had forgotten about it. When I learned about perplexity 20 years ago, in almost every paper people were innovating on the model or the training data, not the tokenization approach. So coming into the LLM space, I had forgotten that the per-token issue is a gotcha. With that said, I think the per-token perplexity is a gotcha for many people, and I'll explain more.

that is where I would look for documentation about perplexity in particular. And the people that would need to see the warning the most are not the type of people to carefully read READMEs anyways.

Look, I know what perplexity is, and I also know that people implement evaluation measures in non-standard ways for optimization reasons. I skimmed the README.md and didn't (and still don't) see any discussion of the key implementation details that make perplexity.cpp non-standard. So I didn't read the README.md because I skimmed it and didn't see what I was looking for and decided to move onto the code.

(cc @ggerganov ) What I found is that the llama.cpp implementation of perplexity is not standard. Perplexity should be measured one token at a time, and a deviation from that or approximation of that should be discussed. I intuited from watching the output of perplexity.cpp that it wasn't doing this. From the code, not the README, I saw that it does batching/windowing in some way and, crucially, that the second half of the window is used for scoring, using the first half of the window for context. This stuck out for me as potentially incorrect and something to think about. I now believe it might be more biased than I originally thought. If I am not mistaken, token 128 uses tokens 0..127 for context. token 129 uses tokens 0..128 for context. Etc. So each of the tokens in the right half gets a different context length.

This leads me to the issue that perplexity is a confusing measure because: implementation details like windowing are really important but not often discussed, and people just post "perplexity" scores without even saying which implementation they used. Are llama.cpp perplexity scores the same or comparable to perplexity scores that are measured one-token at a time? TBH this is more the sort of information I would expect in perplexity/README (what's peculiar about this implementation), which I didn't see when I skimmed it.

@JohannesGaessler returning to when you wrote:

I was most definitely aware of it.

Okay. You were aware that per-token perplexity is not comparable between models of different vocab sizes, in theory. But then in practice you, knowing this, made a table mixing non-comparable perplexity scores, and in general comparing per-token scores interleaving between non-comparable model scores:

Why?

And then you wrote:

I definitely agree when it comes to direct LLaMA 2 vs. LLaMA 3 comparisons. I think you can still extract some useful information by comparing the change in metrics towards smaller quant formats though; I think it's undeniable that for example the q2_K quality loss is much more severe for LLaMA 2 vs. LLaMA 3 (based on the token probability metrics). [my emphasis]

But why? Why are you comparing per-token probability metrics for two models with different vocab sizes? Why do you expect that to be sound?

This gets confusing because people get familiar with a score under baseline conditions. And then Llama 3 comes out with a vastly different vocab size and people start trying to read the tea leaves and assuming that a relative change of 1.5x in PPL for llama 3 is just as bad as for llama 2. That sort of assumption scares me and I don't think is justified. Like we can't compare the scores between llama 2 and llama 3 but we can compare the relative difference in scores between llama 2 and llama 3? That's quite a stretch.

Like the analogy I'd make is Farenheit and Celsius. Most people know what they both are in theory, but not in their bones. If you live in a C-degree place, you will stumble and intuit incorrectly when people talk about a 5-F-degree drop in temperature. And that's just a linear transformation between F-degrees and C-degrees that is difficult to reason about! Whereas with perplexity we're talking about log and exp transforms in several places.

Another proposal could be that we just start outputting the vocab size, at the very least, before perplexity. Like call it 32k-perplexity. Just like you wouldn't make a table with columns: New York, Berlin and row: degrees and fill it in with F-degrees for New York and C-degrees for Berlin. You would call attention to the fact that the scores are not comparable by having "F-degrees" in one row, "C-degrees" in another row, and blanks where you didn't have the value. Similarly, "32k-perplexity" and "128k-perplexity" values shouldn't really be compared in the same table. And per-token numbers shouldn't be compared for models of different vocab sizes, especially when trying to make statements like: "the q2_K quality loss is much more severe for LLaMA 2 vs. LLaMA 3 (based on the token probability metrics)". I mean I believe (but am not confident) it's more severe. Is it actually much more severe? IDK and I'm not sure any of us do. Particularly since your work was trying to nail down people's qualitative experience into something quantitative, it's important to be rigorous about the quantification. Otherwise, the inferences drawn based upon quantitative measures are worse than data we openly acknowledge to be more qualitative.

Look, getting good evaluation measures is hard but important. Because they tend to be so counterintuitive, a field will latch on to one measure that is as best comparable as possible, and learn about the measure really deeply through a variety of experiments. But if the intuitions we draw from 32k-perplexity may incorrectly generalize to 128k-perplexity, then perhaps we gain more from trying to standardize perplexity comparisons.

With that said, maybe this issue is the wrong place to discuss innovating on LLM evaluation measures. But I still would say: per-token perplexity should be deprecated because it's not comparable in many ways and leads people to make strange assumptions and interferes with people's ability to reason correctly about scores. (Or, we should report "vocabsize-perplexity" scores as the simplest solution to suggest comparison smell.) It also even trips up experienced people like yourself who put different non-comparable perplexity scores in the same table row as if they were comparable, or to compare per-token scores differences across model with different vocab sizes to draw inferences. per-token perplexity was important historically but I think with recent LM innovation also including fast-moving changes to tokenization, that tokenization is no longer a control variable, and should be a factor in our choice of scoring method.

If we want to switch to something standard, it would be bpc. But I do think it would be interesting to do per-character/per-byte perplexity, and see also how the half-window versus per-token scores compare for a variety of context sizes.

turian commented 4 months ago

I just verified that my memory of how the field moved was true. Basically in the 90s and 2000s tokenization was relatively standardized across NLP tasks, which was essentially word-based. The Penn Treebank (PTB) and Brown Corpus were common datasets used with fixed tokenizers. Most research focused on improving N-gram models or training data.

The 90s was mainly about introducing N-gram models, improving estimates with smoothing, giving them more training data.

The 00s saw the introduction of neural language models (Bengio 2003) which still used standard tokenization.

But for maybe 25 years, it was basically all word-based tokenization. Tokenization was relatively standard, it really just came down to what threshold long-tail infrequent words you omitted from your vocabulary, so tokenization didn't vary much from paper to paper. It wasn't until more recently that BPE and sentence-piece models got introduced that tokenization approaches started to differ very significantly, which is further compounded by the fact that now vocab size can vary wildly. This wasn't really the historical context in which per-token perplexity was devised as a scoring measure. We knew it was per-token, but we also just assumed token = word. (Okay not 100% true. It was considered kooky and wild and worth noting when something was doing per-character based language modeling, like Elman 1990 or Sutskever et al., 2011. So when the tokenization was really different, it was a point of interest.)

Historical note: Papers with per-word perplexity would discuss where they got their vocabulary and what frequency of rare word would be considered OOV (out-of-vocabulary). BPE was devised in 1994 but I don't think used in NLP until 2015 when "Neural Machine Translation of Rare Words with Subword Units" by Sennrich et al. explicitly set out to solve the problem of OOV words. 2018 BPE became more popularized with the BERT paper. So, what's funny here is that the key issue in old per-word perplexity scores (the variability of OOV from one work to another) is what lead to experimentation in sub-word tokenization approaches.

JohannesGaessler commented 4 months ago

I intuited from watching the output of perplexity.cpp that it wasn't doing this. From the code, not the README, I saw that it does batching/windowing in some way and, crucially, that the second half of the window is used for scoring, using the first half of the window for context. This stuck out for me as potentially incorrect and something to think about. I now believe it might be more biased than I originally thought. If I am not mistaken, token 128 uses tokens 0..127 for context. token 129 uses tokens 0..128 for context. Etc. So each of the tokens in the right half gets a different context length.

This leads me to the issue that perplexity is a confusing measure because: implementation details like windowing are really important but not often discussed, and people just post "perplexity" scores without even saying which implementation they used. Are llama.cpp perplexity scores the same or comparable to perplexity scores that are measured one-token at a time? TBH this is more the sort of information I would expect in perplexity/README (what's peculiar about this implementation), which I didn't see when I skimmed it.

I intentionally did not write about those particular things because I don't want to invest the effort to keep this information up to date and outdated documentation is worse than no documentation at all. I am simply not that interested in comparing llama.cpp against other frameworks. If you want to do that using something like Oobabooga where the exact same code can be used for tokenization and perplexity calculation is probably a better choice.

Okay. You were aware that per-token perplexity is not comparable between models of different vocab sizes, in theory. But then in practice you, knowing this, made a table mixing non-comparable perplexity scores, and in general comparing per-token scores interleaving between non-comparable model scores:

Yes, because I'm making these tables for people that know how to interpret the values. As long as the reader knows what they're looking at more information literally has no downsides. And if one wanted they could then for example take the otherwise not comparable per-token perplexity values and normalize them to the length of the input text instead by just applying a factor.

But why? Why are you comparing per-token probability metrics for two models with different vocab sizes? Why do you expect that to be sound?

I was comparing mean Δp of a quantized model vs. the FP16 model. Even though the vocabulary is different and therefore makes correctly predicting the next token not equally difficult for LLaMA 2/3 we know that two different quantization formats of the same model are given the exact same problem. For the same text LLaMA 3 q2_K made ~3.4x more mistakes per token relative to its FP16 version than LLaMA 2 did. Even if you normalize that to the length of the text instead that is still only a factor of ~1.16x and does not explain this discrepancy. The chunks of course do not align with different tokenizers but I really don't think that this is it either.

I don't remember to have made any claims about the perplexity values for LLaMA 2 vs. LLaMA 3.

turian commented 4 months ago

I intentionally did not write about those particular things because I don't want to invest the effort to keep this information up to date and outdated documentation is worse than no documentation at all.

I can understand not wanting outdated documentation. But does the windowing and scoring strategy of perplexity.cpp change so often? When it does change, if it affects the results (which I suspect it does), then some announcement or versioning should be made on the README or Changelog or similar. Otherwise people might be comparing different scoring methods? Or is this project just so fast-moving that this sort of breakage is expected across versions? I am, of course, aware that month to month the generation might change.

[edit: will you leave stale perplexity tables in when the scoring changes? and just edit the README and say: "these scores are old"? Just curious what the plan is for maintaining them]

I am simply not that interested in comparing llama.cpp against other frameworks. If you want to do that using something like Oobabooga where the exact same code can be used for tokenization and perplexity calculation is probably a better choice.

It's not really about llama.cpp versus other frameworks. The issue of half-window scoring is about researchers being able to safely publish llama.cpp perplexity scores and call them "perplexity" scores and not "llama.cpp b2797 perplexity" scores. Because a) llama.cpp perplexity may or may not correlate well with standard last-token perplexity and b) llama.cpp won't document any deviations from standard evaluation, so you have to look up perplexity.cpp implementation at a particular hash number in order to understand what is being reported.

If I may ask, what is your interest in llama.cpp perplexity scores?

Yes, because I'm making these tables for people that know how to interpret the values. As long as the reader knows what they're looking at more information literally has no downsides.

The table is hard to read because you interleave the Llama 3 and Llama 2 scores in your PR. That is in fact a downside.

You did and now I do know how to interpret the values.

And you label the table "LLaMA 2 vs. LLaMA 3 comparison". So what are we supposed to be comparing?

And if one wanted they could then for example take the otherwise not comparable per-token perplexity values and normalize them to the length of the input text instead by just applying a factor.

Right, that's what I think the default method of reporting perplexity scores should. Similarly to how scores that evaluate stuff on text of different length usually normalizes against the length rather than just reporting a raw score and letting other people do the work downstream.

I was comparing mean Δp of a quantized model vs. the FP16 model. Even though the vocabulary is different and therefore makes correctly predicting the next token not equally difficult for LLaMA 2/3 we know that two different quantization formats of the same model are given the exact same problem. For the same text LLaMA 3 q2_K made ~3.4x more mistakes per token relative to its FP16 version than LLaMA 2 did. Even if you normalize that to the length of the text instead that is still only a factor of ~1.16x and does not explain this discrepancy. The chunks of course do not align with different tokenizers but I really don't think that this is it either.

Yes cool, this is the sort of detail I like to be discussing. Thank you for sharing. That is indeed a jump and worth understanding.

So you found that llama 2 has 2.93x more tokens than llama 3?

How did you get your numbers above? Of 3.4x more mistakes per token with llama 3 q2_K? In your table the mean delta p is reported as -9.123 ± 0.051 %. Thank you for explaining, I appreciate it.

I don't remember to have made any claims about the perplexity values for LLaMA 2 vs. LLaMA 3.

No, I was talking about per-token probability comparisons you made, which I still think are more interesting when transformed into per-character scores.

Also, digging into your README, why do you say "Note that perplexity is not directly comparable between models, especially if they use different tokenizers."? I don't know if that is true in general. If the statement were rewritten to the stronger: "Perplexity is directly comparable between models unless they use different tokenizers." would you agree or disagree?

JohannesGaessler commented 4 months ago

I can understand not wanting outdated documentation. But does the windowing and scoring strategy of perplexity.cpp change so often? When it does change, if it affects the results (which I suspect it does), then some announcement or versioning should be made on the README or Changelog or similar. Otherwise people might be comparing different scoring methods? Or is this project just so fast-moving that this sort of breakage is expected across versions? I am, of course, aware that month to month the generation might change.

It doesn't change often, and it wouldn't be that much effort, I'm just not willing to put it in.

It's not really about llama.cpp versus other frameworks. The issue of half-window scoring is about researchers being able to safely publish llama.cpp perplexity scores and call them "perplexity" scores and not "llama.cpp b2797 perplexity" scores. Because a) llama.cpp perplexity may or may not correlate well with standard last-token perplexity and b) llama.cpp won't document any deviations from standard evaluation, so you have to look up perplexity.cpp implementation at a particular hash number in order to understand what is being reported.

If I may ask, what is your interest in llama.cpp perplexity scores?

I don't care whether or not researchers can use the perplexity binary. My concern is almost exclusively to measure the precision loss from performance optimizations that trade precision for speed and/or reduced memory use, such as quantization within llama.cpp and with no relation to other frameworks.

So you found that llama 2 has 2.93x more tokens than llama 3?

No, what I meant is that LLaMA 2 for the same text needs 1.16x more tokens.

How did you get your numbers above?

I just compared the number of chunks when feeding Wikitext-2 test to perplexity. LLaMA 2 gives you 655 chunks, LLaMA 3 gives you 564 chunks. If you neglect the error from the last, partial chunk that gives you a ratio of 1.16.

Also, digging into your README, why do you say "Note that perplexity is not directly comparable between models, especially if they use different tokenizers."? I don't know if that is true in general. If the statement were rewritten to the stronger: "Perplexity is directly comparable between models unless they use different tokenizers." would you agree or disagree?

No, because finetunes use the same tokenizers as base models and other finetunes and are not at all comparable via perplexity.

turian commented 4 months ago

It doesn't change often, and it wouldn't be that much effort, I'm just not willing to put it in.

Fair enough.

I don't care whether or not researchers can use the perplexity binary. My concern is almost exclusively to measure the precision loss from performance optimizations that trade precision for speed and/or reduced memory use, such as quantization within llama.cpp and with no relation to other frameworks.

Okay. Understood. I'll try to reply with this worldview in mind, I'm just someone interested in using llama.cpp for empirical research. So we have different needs.

No, what I meant is that LLaMA 2 for the same text needs 1.16x more tokens.

Okay. Got it now.

Can you explain how you computed 3.4x more mistakes per token with llama 3 q2_K? In your table the mean delta p is reported as -9.123 ± 0.051 %.

No, because finetunes use the same tokenizers as base models and other finetunes and are not at all comparable via perplexity.

Why not?

JohannesGaessler commented 4 months ago

Can you explain how you computed 3.4x more mistakes per token with llama 3 q2_K? In your table the mean delta p is reported as -9.123 ± 0.051 %.

That number is computed by measuring the probabilities that the FP16 and q2_K models assign to the "correct" token with 1.0 temperature and no other samplers and simply averaging the difference. So in essence LLaMA 2 q2_K causes 2.7% more incorrect guesses relative to FP16 while it's 9.1% for LLaMA 3. What I should have said is that the additional mistakes per token due to quantization are 3.4x higher, not the absolute number of mistakes.

Why not?

Because perplexity over a text corpus measures how well a model can reproduce that specific type of text. So finetunes that are intended to produce a specific type of text are going to score better/worse depending on how that text type aligns with the text corpus.

turian commented 4 months ago

That number is computed by measuring the probabilities that the FP16 and q2_K models assign to the "correct" token with 1.0 temperature and no other samplers and simply averaging the difference. So in essence LLaMA 2 q2_K causes 2.7% more incorrect guesses relative to FP16 while it's 9.1% for LLaMA 3. What I should have said is that the additional mistakes per token due to quantization are 3.4x higher, not the absolute number of mistakes.

Okay. Thank you for explaining. Have you used this evaluation measure of relative errors before or is it the first time? I defer to your expertise in these evaluations of quantization impact. Is 3.4x relatively higher mistakes actually important? I mean it sounds like it but I haven't thought about it enough.

Because perplexity over a text corpus measures how well a model can reproduce that specific type of text. So finetunes that are intended to produce a specific type of text are going to score better/worse depending on how that text type aligns with the text corpus.

I don't understand this.

When I make this strong statement: "Perplexity is directly comparable between models unless they use different tokenizers." I mean that using perplexity on different corpora, I can directly compare perplexity scores of BLIND models and make valid high-precision statements about their capabilities.

I guess I should be careful to say that I mean "perplexity on a particular corpus", not treating "perplexity" as if there's only one value of it.

General Wikipedia | Biomedical Articles | Legal Documents
-- | -- | --
Model 1 | 25.0 | 60.0 | 65.0
Model 2 | 22.0 | 55.0 | 60.0
Model 3 | 45.0 | 40.0 | 35.0

So I can conclude Model 1 is better than Model 2 in general and on biomed and legal, and conclude Model 3 is better on biomed and legal than models 1 and 2.

Then I unblind the models, and model 3 was indeed model 2 or 1 finetuned on biomed and legal. So I have been able to draw reasonable high precision conclusions from blinded models just by directly comparing their perplexities (and knowing they have the same tokenizer).

If you do know about the models, you can make statements about the variables not controlled for. For example if they have the same tokenization and architecture, but different training data or param size, then you can use perplexity to say: training data AND/OR param size makes this model worse, as evidenced by lower perplexity than that model.

So I still in this case take a stronger position than you: You can compare per-token perplexity scores between ANY two models with the same tokenizer, and the conclusions you draw from those comparisons will be high-precision. But I invite any non-pathological counterexamples.

Thus, in my view, if you make the perplexity measure per-byte or per-character, I would take the even stronger position that now you can compare the per-character/per-byte complexity scores between ANY two models, and the conclusions you draw from those direct comparisons will be high-precision.

turian commented 4 months ago

The reason I'm interested in perplexity is because---if you believe my argument---then, by choosing different corpora carefully, you can use that to understand the capabilities of different models.

This is important because constructing good evaluation sets using labeling for a particular domain can be very expensive / hard. But unsupervised corpus collection for model evaluation and leaderboards is much much easier.

JohannesGaessler commented 4 months ago

Have you used this evaluation measure of relative errors before or is it the first time?

It is the first time and I only added this specific metric in the PR where I also added the tables and updated the README.

Is 3.4x relatively higher mistakes actually important?

The measurement should definitely be sufficiently precise but I did not quantify any systematic uncertainties.

I mean that using perplexity on different corpora, I can directly compare perplexity scores of BLIND models and make valid high-precision statements about their capabilities.

How are you going to separate style from content? One could feasibly train a model that gets facts wrong but writes in the style of academic journals and also a model that has terrible spelling and grammar but gets facts right. While these two cases are of course not likely to actually happen I do intuitively believe that style is going to have more of an effect on perplexity than the actual model quality as it would be rated by humans.

turian commented 4 months ago

It is the first time and I only added this specific metric in the PR where I also added the tables and updated the README.

Okay. In this respect I am a bit more conservative. If I'm using a new measure, I trust the directionality of the scores but not necessarily their magnitudes, until I've played with it on different data for a bit.

The measurement should definitely be sufficiently precise but I did not quantify any systematic uncertainties.

I guess I'm more agnostic about whether 3.4x is catastrophically worse or just kinda worse. It can be hard to lock in what a new score means for qualitative judgment.

How are you going to separate style from content? One could feasibly train a model that gets facts wrong but writes in the style of academic journals and also a model that has terrible spelling and grammar but gets facts right. While these two cases are of course not likely to actually happen I do intuitively believe that style is going to have more of an effect on perplexity than the actual model quality as it would be rated by humans.

You are talking about evaluation of text generation, I am more interested in evaluation of text understanding. Since I think that good understanding is the current difficulty that the state-of-the-art is pushing, and good styling is more of an engineering problem at this point.

Within the context of text generation, yeah style evaluation prevails. That's why people criticize the lmsys leaderboard for giving the highest scores to models that supplicate to you at the level perfectly attuned to the average tester, rather than being right or in-depth or evidence-based. I guess that's why I'm interested in measures like perplexity where style isn't really that important.

With that said, I still think perplexity is cool for learning about models (understanding AND generation) even in the presence of stylistic variation or bad spelling/grammar. What's cool about perplexity is you can propose a new factor of language modeling to evaluate, and someone can construct a style evaluation of spelling/grammar evaluation corpora pretty easily.

i.e. I believe I could construct evaluation corpora that would use perplexity and could tell you things about blind models like:

This model generates in the style of scientific articles. However, it has a shallow understanding of science itself, as evidenced by this medium-low perplexity. Versus, this model has a deep understanding of science itself, because it has a low perplexity. Like one could tease out two independent variables by appropriately constructed corpora: A) How much science was this model trained on versus non-science. B) How good is this model at modeling/understanding/generating overall. I could be even try to spot: This is a weak model fine tuned on science vs this is a strong model not fine tuned on science.
This model prefers bad spelling/grammar but also true facts.

So my argument is also about making reasonable inferences about models, by quickly constructing new datasets and evaluating and comparing model perplexity on them.

turian commented 4 months ago

One more thing that occurred to me, which maybe explains your feeling that you can't compare perplexity between different models: If we were talking about wikitext perplexities mainly, then yeah I can see how I would sour on using perplexity for comparing different models after seeing amazing models have worse perplexity than poor models but that contained wiki in their training data. That's why my interest is in computing perplexity on held-out corpora, for model comparison.

JohannesGaessler commented 4 months ago

@turian what do you think of comparing the rate of incorrect token predictions per text? I'm thinking, regardless of which tokenizer you use, once you sample a single incorrect token the whole sequence is going to diverge. And in the end what matters aren't the individual tokens anyways but the resulting text. So I'm thinking this could be used as a high-precision metric to compare any two models for a given text corpus.

JohannesGaessler commented 4 months ago

Actually, now that I think about how to combine the incorrect tokens into a single number at the end I would be using the negative log-likelihood at which point you would be back at perplexity per token.

turian commented 4 months ago

@JohannesGaessler Right. This comes back to the fact that the bigger your vocabulary, the more choices you have at every step. (And the fewer number of steps you have.) Which is why I think there should be a push for a uniform measure.

You see that this stuff is quite difficult to reason about and requires considerable thought. That's why a push for something less foot-gunny is good for people of different experience levels

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

Saibo-creator commented 2 months ago

If we want to switch to something standard, it would be bpc. But I do think it would be interesting to do per-character/per-byte perplexity, and see also how the half-window versus per-token scores compare for a variety of context sizes.

Thanks for opening this thread and share insights ! @turian Regarding the per-character vs per-byte perplexity, would't that per-character perplexity is better because the latter would be depending on unicode encoding schema like utf8 vs utf16 which intrinsically encode characters using different number of bytes ?

If we go with per-byte perplexity, then if one day a model trains tokenizer on utf16, it will be becomes incomparable.

turian commented 2 months ago

@Saibo-creator Agreed that there is a good argument for per-character vs per-byte. I think the argument would fundamentally come down to which is a truer surrogate for pure data compression.