utterances-bot commented 3 years ago

Text classification by data compression - Max Halford

Last night I felt like reading Artificial Intelligence: A Modern Approach. I stumbled on something fun in the natural language processing chapter. The section I was reading dealt with classifying text. The idea of the particular subsection I was reading was to classify documents by using a compression algorithm. This is such a left field idea, and yet it does make sense when you think about it. To quote the book:

https://maxhalford.github.io/blog/text-classification-by-compression/

eternaleye commented 3 years ago

So, one minor correction - you say this:

Moreover, it’s expensive because the training texts of each label have to be recompressed for each test document.

However, this isn't quite the case! As you never need to run the decompression, you can get away with only retaining the compressor state as of the end of the training data; some implementations already support copying that state, allowing you to keep it as a (relatively small compared to the compressed data) "gold master" you clone for each candidate. This avoids recompressing the training data each time.

Some compressors have even turned this into a full feature in its own right - zstd has a --train operation, which consumes a corpus of training documents (it does the concatenation for you) and produces a dictionary you can feed to later invocations.

MaxHalford commented 3 years ago

Hello @eternaleye, thanks for the comment, it's always nice to learn from strangers. I'll look into zstd.

Do you see any way to make this work with Python's standard library? In the case of lzwa, I've tried to use the LZMACompressor class, but to no avail.

eternaleye commented 3 years ago

Sadly, no - many implementations don't expose the compressor state as a value, let alone allow cloning it; in addition, I don't do much work in Python to begin with, so I'm not familiar with the specific libraries. There's also the complication that you'd have to ensure that the compressor has emitted everything it can before being "frozen" as the gold master, or else the un-emitted output of the training dataset may confuse the results.

Mainly, I wanted to point out that this is at worst an implementation limitation, rather than an algorithmic one.

eamonnkeogh commented 2 years ago

Nice, we did exactly this in [a], using text from various translations of the first fifty chapters of the bible, and the text from various Yahoo portals (in different languages) and with DNA!

[a] https://www.cs.ucr.edu/~eamonn/SIGKDD_2004_long.pdf

MaxHalford commented 2 years ago

That's really wonderful @eamonnkeogh, thanks a lot for sharing! If I may ask, how did you stumble on this blog post? :)

cyrilou242 commented 11 months ago

Better late than never, to follow on @eternaleye point,

Here is an implementation of the idea of re-using a compressor state with zstd: https://github.com/cyrilou242/ftcc

MaxHalford commented 11 months ago

That's so cool @cyrilou242, thanks for following up! So what's the in terms of numbers? How long does learning with a single sample and making a single inference cost? I'd love to add this to River if the figures are decent 😁

cyrilou242 commented 11 months ago

So what's the in terms of numbers?

I have numbers when running on 20_News on all categories. It should be straightforward to add your case of only using 4 categories. I'll try to do it this week-end. With all categories: accuracy=0.79 On my machine train: 0.4 seconds (most of the time is spent on python string concatenation) predict: p90 28.908ms (it's linear with the number of categories, so should be ~5x faster with only 4 categories)

How long does learning with a single sample and making a single inference cost?

I haven't looked at online update yet, but it should be possible. There are 2 main approaches: without dictionary compression (all the training data in memory, same as you do, model size ≈ training data) and without dictionary compression (zstd --train, model size < training data). A naive online update without dictionary compression, with 2x dataset_size memory is straightforward to implement. Not sure for 1x dataset_size memory consumption. Online update with dictionary compression is in theory possible but I don't know yet if it's straightforward.

cyrilou242 commented 11 months ago

Here are the accuracy results for your dataset (called 4News in my codebase)

+--------------------------------------------------+
|                   Method                   |4News|
+--------------------------------------------+-----+
|    FFTC ZSTD_CL9 dataset_prefixed CPC_1    |0.888|
+--------------------------------------------+-----+
|    FFTC ZSTD_CL9 dataset_prefixed CPC_3    |0.903|
+--------------------------------------------+-----+
|    FFTC ZSTD_CL9 dataset_prefixed CPC_5    |0.893|
+--------------------------------------------+-----+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_1|0.892|
+--------------------------------------------+-----+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_3|0.905|
+--------------------------------------------+-----+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_5|0.891|
+--------------------------------------------------+

dataset_prefixed corresponds to maintaining the whole training data in memory (what your original method does) size_unbounded_optimized corresponds to trying to compress the training data. CPC_n means compressor per class. Building multiple compressor for each class instead of only one tends to stabilize the inferences (a good number depends on the distribution of classes and number of training examples - I did not put thoughts into this) I'll add more performance metrics when I have time.

Here is the train and inference time:

+--------------------------------------------------------------------------+
|                   Method                   |4News_train|4News_predict_p90|
+--------------------------------------------+-----------+-----------------+
|    FFTC ZSTD_CL9 dataset_prefixed CPC_1    |    0.1s   |     1.151ms     |
+--------------------------------------------+-----------+-----------------+
|    FFTC ZSTD_CL9 dataset_prefixed CPC_3    |    0.1s   |     2.723ms     |
+--------------------------------------------+-----------+-----------------+
|    FFTC ZSTD_CL9 dataset_prefixed CPC_5    |    0.1s   |     5.016ms     |
+--------------------------------------------+-----------+-----------------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_1|    4.9s   |      2.16ms     |
+--------------------------------------------+-----------+-----------------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_3|    4.1s   |      2.74ms     |
+--------------------------------------------+-----------+-----------------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_5|    4.2s   |     6.541ms     |
+--------------------------------------------------------------------------+

Indeed a bit faster than your original method 😁

The compression ratio is pretty low. (3.98 --> 3.44 Mb) Not really useful in this case. Compression only becomes interesting when the training data is order of hundreds of megabytes of gigabytes.

+--------------------------------------------------------+
|                   Method                   |   4News   |
+--------------------------------------------+-----------+
|    FFTC ZSTD_CL9 dataset_prefixed CPC_1    |3.981566 Mb|
+--------------------------------------------+-----------+
|    FFTC ZSTD_CL9 dataset_prefixed CPC_3    |3.981558 Mb|
+--------------------------------------------+-----------+
|    FFTC ZSTD_CL9 dataset_prefixed CPC_5    | 3.98155 Mb|
+--------------------------------------------+-----------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_1|3.444753 Mb|
+--------------------------------------------+-----------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_3| 4.75529 Mb|
+--------------------------------------------+-----------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_5|4.933784 Mb|
+--------------------------------------------------------+

The above is reproducible in a few seconds with python main.py -s -1 -s 0 -d 4News

MaxHalford commented 11 months ago

There are 2 main approaches: without dictionary compression (all the training data in memory, same as you do, model size ≈ training data) and without dictionary compression (zstd --train, model size < training data).

In this case, is it possible to operate on a sliding window? Having the training data in memory is acceptable if it can be limited to a window of size k. See what I mean?

Online update with dictionary compression is in theory possible but I don't know yet if it's straightforward.

Yes, this is what I tried to figure out a while ago looking at zstd, but I couldn't figure it out. It would be really cool though.

And excellent sharing of the results. It's very promising 🤩

MaxHalford / maxhalford.github.io

blog/text-classification-by-compression/ #14

Text classification by data compression - Max Halford