Problem with accuracy calculation?

kts commented 1 year ago

https://github.com/bazingagin/npc_gzip/blob/a46991564161023bba3b1267e0e74c69dab8f8eb/experiments.py#L116

It appears that in the calc_acc method it marks a sample correct if ANY of the labels with the same most_count have the correct label.

For k=2 (I think the paper always used k=2?), any time there is a 1-1 tie (top 2 have different labels), it will be correct if EITHER is correct, rather than a "fair" classifier that would have to make a decision before scoring.

This fork has my implementation of the kNN accuracy results for comparisons. see file docstring for notes: https://github.com/kts/npc_gzip/blob/main/calc_acc.py

IlyaGusev commented 1 year ago

It seems to be an original intention of the authors (link):

The question is: do authors report top-2 accuracy for all other methods? I found no code for them in this repository, but I may have missed something. And it seems there is nothing about it in the paper.

Because if you report top-1 accuracy for one set of methods and top-2 accuracy for another set of methods, it is not fair at all.

However, the paper is still very insightful and well-written, even if the numbers are incorrect.

(Edited): You can also compare the results of other methods on popular datasets with other papers. Here is AGNews, for instance. Those results are similar to the paper's results and report top-1 accuracy, not top-2.

thtang commented 1 year ago

So, is this method useful?

daenuprobst commented 1 year ago

I cannot speak for the benchmarks run in the original manuscript, but me reimplementation which reports numbers for a weighted kNN classifier and also regression (naive kNN regressor) for molecular string representations works well given the simplicity of the approach (see https://github.com/daenuprobst/molzip). So the method is useful in that setting, but there is more to do and things to try.

bazingagin commented 1 year ago

Thank you @kts for opening the issue & Sorry for the late reply! re: your issue and blog: https://kenschutte.com/gzip-knn-paper/

When the flag rand=False, the logic of the code is to calculate the maximum accuracy when tied so it does mean an ideal situation when the guess is always right. This means:

It's not a bug.
The logic of the code doesn't report top k accuracy.
This strategy is used for all non-parametric methods like W2V and SentenceBERT.

I want to elaborate the point 2. Let's first assume "top k accuracy"[1] here means "the percentage of nearest k labels that contain ground truth". Point 2 is pretty clear when k=3: Say we have a y1="sports", y2="sports", y3="tech", for the code snippet we will have predicted class as "sports" and we won't score if the ground truth is "tech". But for the top3 accuracy, we will score even if the ground truth is "tech". So clearly, top3 accuracy will have higher score compared with our code snippet. This reasoning can be easily extrapolated to k>3 cases. I hope at this step, everyone is at least convinced that logically the code snippet doesn't report top k accuracy. So what's the difference between k=3 and k=2? And why you think the code is reporting top2 accuracy? It's because numerically, the score of "top2 accuracy" and "k=2 with maximum strategy when tied" are the same under the definition of "top k accuracy"[1]. However, numerical equivalence doesn't mean logical equivalence. Besides, I don't think the definition of [1] makes sense for knn and my reason is as follows. In general, "top2 accuracy"[2] means in the top2 predicted classes, either one is correct, we score. However, in knn when k=2, there is a situation when the 2 nearest neighbor point to the same class - we are missing another class candidate if we want to report top2 accuracy. So for any discriminative models, say BERT, there is no situation when top2 predicted classes can be the same. For tasks like sentiment analysis with only "positive" and "negative" labels, this means any discriminative model's top2 accuracy will be 100%. But for knn, top2 accuracy will mostly be less than that. We are implicitly using different definitions of top k accuracy for knn and discriminative models. To the best of my knowledge, I'm actually not aware of the definition of [1] or other specific definition of top k accuracy defined for knn. The reason I chose k=2 in the first place is that it's recommended to use $\sqrt{n}$ and I wanted to pick a k that's compatible with 5-shot setting. That being said, I should have reported different accuracy under different values of k and different "tie strategies". I think k=2 with maximum is the upperbound accuracy we can get but this doesn't make our result invalid or served as an unfair comparison with other methods because of the reason I mentioned above. Please let me know your thoughts!

Also, I would love to add "decrement" strategy to the repo and let people choose the strategy freely if you'd love to.

kts commented 1 year ago

Thanks for the reply. Sorry if this added extra stress to your thesis day!

Now, I understand that this is the desired behavior rather than a bug.

Why do I call it top-2 accuracy? In k=2 case, you find the two closest training points to your test point and look at their labels. Let's call the labels C for correct and W for wrong, depending on if it equals the test point's label or not. There are 4 possibilities,

labels   voting       output
---------------------------------
WW       W wins 2-0   W (the two Ws could be different, but the result is the same)
WC       1-1 tie      C
CW       1-1 tie      C
CC       C wins 2-0   C

So, it's equivalent to being correct if the correct label is one or both of the top 2. (I did try to mention in the blog that it's not computing top-k for k>2, because it's only looking at those TIED for highest count, not all k).

The problem is with the 1-1 ties. You need some method to break the tie, and in a "fair" classifier you cannot look at the test label. I think if you use k=1, the code is fine (never have ties), but the results will be necessarily lower (and I believe k=2 was used for the results in the paper).

You're right, I too have never seen "top-k accuracy" used in the context of kNN, so maybe the wording of "top-2" isn't very helpful and I shouldn't use that. It just seemed a concise way to describe the phenomenon described above. (Yes, it's a bit different here because we're talking about k points instead of k labels).

I'm not sure I followed everything you wrote, but maybe you agree with this and the main issue is just making that more clear in the paper. That's possible, but I'd say the main point is: this method uses the test label as part of its decision process which is not the standard classification setting and can't be fairly compared to others that don't.

flipz357 commented 1 year ago

Hi all,

thanks @bazingagin for the code and make me think about gzip, such an intriguing way of measuring similarity!

About taking the correct label if predictions are tied.

This strategy is used for all non-parametric methods like W2V and SentenceBERT

This sound fair, but I'm not 100% sure if really fair, since likelihood of ties may be greater for some method than for other. Do you know how often the tie happens for different method? I guess if it's about equal than it may be fair. Maybe it's good to recalculate all method and baselines with the random selection by @kts and getting some mean over different seeds.

bazingagin commented 1 year ago

Hi @kts, thanks for your reply. No worries, I think it's my responsibility to clarify things : ) I'm glad we are on the same page about the top k accuracy! Regarding the tie-breaking strategy: I agree with you that the maximum strategy is too ideal, not practical. But I don't quite agree the problem here is that the test label is used in the decision process. Code-wise, I'm calculating the accuracy on the fly together with making predictions for simplicity. But I can totally separate these two processes by running multiple times for prediction on single dataset to get every possible predicted label list, and calculate the accuracy for each list, then check the minimum, maximum, and average accuracy over those lists. That's why I always said my reported accuracy is an upperbound because in the above scenario, I only report the maximum accuracy among all the predicted label lists. Therefore, I still think the maximum strategy can serve as one of the options for tie-breaking in the code, for upperbound. But I think reporting it alone is not enough. As you suggested, I will modify the paper for (1) Making things clearer: Originally I only mentioned "we report the maximum possible accuracy". I will elaborate more on the details. (2) Report 1<=k<=5 for both random and maximum accuracy. I will also list the result in the repo.

bazingagin commented 1 year ago

Hi @flipz357,

Thanks for your comments. I agree with you that different methods will have different likelihood on tie-situation. So yes, I will report the accuracy with random selection.

kts commented 1 year ago

Okay, you're free to use the wording that you see fit and think readers will understand. I didn't write the blog post or this issue to make you change the paper, etc, but for others trying to recreate the results.

But I do maintain that using this approach (either looking at the test label or running multiple times with randomness and taking the best) makes the comparisons with other baselines unfair - i.e. it's wrong to say "this beats BERT in these tests" - unless you get those results with a classifier that doesn't look at the answer.

A probably bigger problem with at least some of the datasets someone pointed out here https://github.com/bazingagin/npc_gzip/issues/13

Anyways, it's still an interesting approach and paper. Cheers.

eric-xw commented 1 year ago

@bazingagin I was really excited about this paper, but I carefully checked the code and here I have to agree with @kts that the test labels are indeed used in the decision making process. The "maximum accuracy" you referred to is essentially computed with two steps for the WC (wrong-correct) tie cases: (1) choose the C label as the prediction using the ground-truth label (there is no other guaranteed way to choose a C label over a W label); (2) calculate the accuracy between the predictions and the ground truth labels.

So in a fair setting, there should never be a "maximum accuracy" (it is unfair itself unless other methods are also reporting the exactly same maximum accuracy); there should be one and only one accuracy between the predictions (each data sample should only have one final prediction) and the test labels. The process of obtaining the predictions should be independent of the test labels, for example, via majority voting or random selection.

Hope this helps. I think it is still a good paper after the fix and result update.

bazingagin commented 1 year ago

hey @kts @eric-xw I understand that you think using the max accuracy of knn for comparison is unfair. What I was trying to say is that random guess has its lowerbound and upperbound. And theoretically, the upperbound is achievable (yes, the probability is small if say we have 10 tied instances and we only have (1/2)^10 probability of achieving the highest), but still achievable in theory. But I guess we don't need to seek common ground on that as the discussion of unfair or fair comparison/evaluation can go on and on. For example, as https://twitter.com/drjwrae/status/1679988179179536385?s=21 points out, comparing GPT&knn with gzip&knn is a more fair comparison as we can directly compare the estimated compressed length (negative log-likelihood) of GPT with a compressed length of gzip which can better indicate whether pretrained models are worse in capturing regularity on OOD (it is, according to his experiment in random case). I think these are all great suggestions and should be taken into consideration for the future work (like using weighted knn might be even better). But here, I only want to make sure I disclose every detail in the experiment - people are aware the reported result uses max accuracy, assuming every guess is correct when tied. It's their judgment whether they want to check the maximum accuracy in order to see the full potential of this similarity measurement or they want to use random for a practical classifier.

Thanks for the constructive discussion and I will add random results anyway.

cyrilou242 commented 1 year ago

Hey, thanks for the nice paper. My understanding is @bazingagin has an information theory standpoint, for which measuring the amount of information returned by the 2-NN is relevant. Hence the "maximum" accuracy, with accuracy being a proxy of information captured. It seems as mentionned by @bazingagin that other methods may be more relevant to measure this "information capture".

@kts @eric-xw focus on the end to end task and experimental setup, and challenge the fact that the accuracy numbers would not be reproducible in a real setup, because the label would not be available. The comparison becomes unfair to other models in the table 3 that don't have this label info to exploit.

I will add random results anyway.

I think I'd be more helpful if you'd provide:

the minimum and maximum accuracy (no need for random then, random being - on average - the average of the two) - it's easier to reproduce
the accuracy with k=1 - again easy to reproduce and correspond to a "practical classifier"

flipz357 commented 1 year ago

@cyrilou242

The comparison becomes unfair to other models in the table 3 that don't have this label info to exploit.

imo it may be even unfair within the approaches who can exploit label info, since the approaches may have different likelihood of tie (and approach with more ties have advantage). But yes, if this is the case it may be seen from your suggestion to report

the minimum and maximum accuracy (no need for random then, random being - on average - the average of the two)

kts commented 1 year ago

Okay, last comment on this :)

Why report the max rather than the mean?

A "fair" classifier can certainly use randomness and you may want to run it multiple times to get a better estimate of it's accuracy - but, for that use the mean over the trials.

I can define a classifier that ignores the data and randomly picks a class. It's has a max possible accuracy of 100%.

bazingagin commented 1 year ago

Why report the max rather than the mean?

Because reporting mean induces more stochasticity and/or magnitude more of experiments. As you see, we have few-shot experiments across all the settings. Each shot experiment requires 5 times repeated experiments to draw different training samples. Using random and reporting mean means either inducing more stochasticity, which may make the result hard to reproduce, or repeating at least 5*5 times experiments for each shot setting.

I can define a classifier that ignores the data and randomly picks a class. It's has a max possible accuracy of 100%.

That's not the case here as all datasets have more than 2 classes so even randomly picking top 2 nearest neighbors when tied still requires a reasonable similarity measurement.

bazingagin commented 1 year ago

thanks @cyrilou242, what you said makes sense. I will also include minimum accuracy and k=1 case.

cyrilou242 commented 1 year ago

thanks @bazingagin!

twjang commented 1 year ago

I'm not an expert in this topic, so I have no idea about the standard way to measure accuracies. But, can we use distance information to break ties? To be specific, we calculate the sum of distances of the k-nearest points, and when tie occurs, we choose the class whose sum of distances is the smallest.

flipz357 commented 1 year ago

@twjang Yes, but I guess an issue is that gzip will make more ties since the distance function is coarse grained integer (afaik), while for sentence embedding it is fine grained float.

kts commented 1 year ago

@twjang Yes, this is reasonable strategy. For k=2 (used in paper), the only ties are 1-1, so that strategy is equivalent to k=1 (just take nearest neighbor).

@flipz357 The actual distances used are floats (NCD) https://github.com/bazingagin/npc_gzip/blob/3c219944acfbe4e80d83166eea698ff9da3321af/utils.py#L6 but, it's derived from 3 integers (compressed string lengths), so you're right, maybe it's more susceptible to ties, depending on the dataset.

bazingagin commented 1 year ago

@cyrilou242 here are results with both lowerboud and upperbound. @kts I also include the random (only run once though) for reference.

@flipz357 As you mentioned it may be unfair to other non-parametric methods, here are the results with random strategy using W2V and SentBERT for comparison.

flipz357 commented 1 year ago

Thanks @bazingagin !

Btw, actually you don't need to run random, the expected accurcy (using random guesser to break ties) can be simply calculated: For K=2, as @cyrilou242 mention, it boils down to the average of upper and lower bound. For arbitrary K the accuarcy will depend on the amount of options in each tie, but it can also be calculated.

maxmax1992 commented 1 year ago

@bazingagin , so in other words if top2 closest points are of the correct label and are different you count this as a correct prediction? This doesn't make sense. It's same thing like saying: Yeah I'll assume that I'm very lucky and even if the probability is 50% of picking correct label I will pick the correct one. I'm seeing your point that it's not TOP-2 accuracy, but it's very unfair heuristic to use to get the predicted labels.

bazingagin commented 1 year ago

hey @maxmax1992, sure I do agree it makes the accuracy of gzip inflated. So please check the updated version of the table updated_gzip

The table also includes the updated result for the correct (no overlapped) DengueFilipino dataset. So only considering the random strategy, we can see

In the few-shot settings, gzip(rand) outperforms all methods on Kinyarwanda, Kirundi, Filipino, and SogouNews but is worse than mBERT on Swahili, on which mBERT has been retained.
In the full dataset setting, gzip(rand) is only highest on SogouNews, is competitive on Kinyarwanda and Swahili, and is bad on Kirundi and updated DengueFilipino

flipz357 commented 1 year ago

I overall can confirm the updated numbers that @bazingagin posted. For 5-shot only roughly, but that is due to the large deviation over runs (as is also evident from the large 95-confidence intervals in the table).

I do think gzip is quite interesting and its performance in few shot shows that advanced methods struggle in low resource scenarios. However, I think it should be also noted even simpler methods can also provide good results, for anyone interested: https://arxiv.org/abs/2307.15002

bazingagin / npc_gzip

Problem with accuracy calculation? #3