Jellyfish042 / uncheatable_eval

Evaluating LLMs with Dynamic Data
MIT License
66 stars 4 forks source link

Potential skew due to different tokenizers or vocabularies #1

Closed melang982 closed 7 months ago

melang982 commented 7 months ago

It calculates based on tokens, right? But if models use different tokenizers or vocabularies, wouldn't that skew the results? I really like the idea, by the way; I just thought it might be better to compare by character or word.

Jellyfish042 commented 7 months ago

We calculate the sum of negative log probabilities, so here we're evaluating the ability of the entire system (including both the model and the Tokenizer) to model a piece of text. For a given piece of text, the model with a lower sum of negative log probabilities is more likely to generate it, which means it's better.

In other words, we don't need to concern ourselves with the type of Tokenizer and the vocabulary used. The system that is more likely to generate a piece of real text is considered better.

melang982 commented 7 months ago

Please correct me if I'm wrong: Imagine the ground truth is 一, and the model outputs 不.

Model A treats each byte as a separate token (like GPT-4 does before doing merges), the first two bytes are the same: 一 0xe4 0xb8 0x80 不 0xe4 0xb8 0x8d

The first two 0xe4 0xb8 bytes are used for the 64 very common Chinese characters.

For example, the model predicts 0.95 0.9 0.6, the sum of negative log probabilities is 0.28977 / 3 tokens = 0.0969

Model B treats each Chinese character as a token, it predicts 0.6, the negative log probability is 0.2218. So we're thinking that model B is much worse, even though they both are equally wrong

(this applies to English as well, if a model splits cats into "cat" and "s" and outputs "dog s" and the other has "cats" as a single token)

Jellyfish042 commented 7 months ago

Firstly, when calculating logarithms, we utilize the natural logarithm without applying any averaging or normalization processes. For the example you provided, the calculated result for Model A would be: -math.log(0.95) - math.log(0.9) - math.log(0.6) = 0.667; conversely, for Model B, it would be: -math.log(0.6) = 0.511 (where the numbers refer to the probability at the ground truth position in the model's output). Therefore, in this scenario, Model B is considered better than Model A. This aligns with the facts, as Model B indeed has a higher probability of generating the specified text. I hope this clarifies things for you.

melang982 commented 7 months ago

Isn't it doing averaging here? 'neg_log_prob_sum': sum(data) / len(data), 'neg_log_prob_sum': sum(rwkv_test_data) / len(rwkv_test_data) (from uncheatable_eval.ipynb)

Jellyfish042 commented 7 months ago

This is the average of results across different samples, where each sample is a piece of text. For example, the test results given in the README are the average values across 1000 samples.

melang982 commented 7 months ago

Thanks for the explanation. Still not entirely sure if neg_log_prob_sum can be used across different tokenizers, but I understand better now