cyrilou242 / ftcc

Fast Text Classification with Compressors dictionary
MIT License
146 stars 10 forks source link

Performance in few-shot setup #2

Closed flipz357 closed 1 year ago

flipz357 commented 1 year ago

Awesome work @cyrilou242, thanks!

I wonder if anyone tested the ftcc method against gzip in the few-shot setup, k={5, 25,...}, I thought this was like the highlight of the gzip-paper, showing that gzip can perform reasonable with small training data. I guess if training data is sufficient, there's hardly a way to beat BERT with any non-neural method.

cyrilou242 commented 1 year ago

the ftcc method against gzip in the few-shot setup, k={5, 25,...}, I thought this was like the highlight of the gzip-paper

I have not tested yet. I think the result is artificial and not that interesting.

I can have a try though. Maybe this week-end.

flipz357 commented 1 year ago

Thanks!

I have not tested yet. I think the result is artificial and not that interesting.

Why do you think it is not interesting? I'd even say that the underlying KNN doesn't really make sense if there is much training data (it is too costly). Even if accuracy results may be low, it all depends if it is better than baselines.

Maybe it's of interest to you, I ran some experiments with bag-of-words and gzip on some data sets, with different number of shots. Sometimes the accuracy results are not even that bad (like on DBPedia). These data sets also don't have the issues of data contamination (afaik).

image

Image is from this write-up.

Again, ftcc is very cool and interesting work!

cyrilou242 commented 1 year ago

Why do you think it is not interesting?

I'm an industry guy, I may be biased.
Contrary to some cases for which labelled data is hard to obtain, labelled text classification data is pretty easy to obtain. So let's say my goal is to have a good model with a given budget. I start with a dataset of 5 observations. I have two choices: spend money on improving the model or spend money on improving the data. It is pretty obvious at 5 observations that any dataset improvements will improve the system order of magnitudes more than improving modelling. Also, a dataset improvement is often a better investment than a model improvement: labelled data tends to live longer than a particular model.

I can see a use case for cold-start in labelling UI like https://prodi.gy/ though.

Anyway given you have a nice paper with nice numbers, I can give it a try. Do you have the code with a random seed or an exact list of the samples taken for the few-shots?

flipz357 commented 1 year ago

It is pretty obvious at 5 observations that any dataset improvements will improve the system order of magnitudes more than improving modelling.

That's true, now I get where you're coming from!

But, I mean, doesn't this then beg the question, what is even the purpose of using a KNN for text classification (regardless of which distance function you take)?

I can see a use case for cold-start in labelling UI like https://prodi.gy/ though.

Interesting I'll check this out.

Do you have the code with a random seed or an exact list of the samples taken for the few-shots?

Unfortunately not, my code is just changing the distance function in the gzip paper (and fixing the evaluation).

flipz357 commented 1 year ago

btw, I now tested a SVM+tfidf bag of words approach. It performs very strongly on full training data (almost similar level as BERT), but is also strong in few-shot setup:

Here's the preliminary table (I'm also gonna update my note with it, soon):

image

(SIMPLE in the image above has become BowDist)

So I am starting to wondering a bit why was the gzip paper hyped so much in the first place.

flipz357 commented 1 year ago

I updated my paper with more results now.

Seems as if BoW approaches are very strong. When used as "untrained" distance measure in KNN it performs better than GZIP distance in KNN, when used inside a trained classifier, it almost achieves same performance as BERT.

It'd still be interesting for me how ftcc relates to all of this (maybe it can be used as a feature in a bow-classifier?), and also whether it also works with fewer training data, but I'm gonna close this issue for now.