FTCC: Fast Text Classification with Compressors dictionary
This repository implements a compression based classification technique that is fast at training AND
at inference.
It is a reaction to Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors (referred as the gzip method from now on)
that received a lot of attention.
The paper mentions different techniques based on compression, but does not mention the technique implemented in
this repository.
I suspect the FTCC technique in this repository has already been tested in the industry, but I could not find an
implementation online, so here you go. I think it is a reasonable baseline for compression based classification.
It is multiple orders of magnitudes faster than the gzip method and the other compression-based classifiers mentioned in the paper,
and has comparable accuracy (yet to validate).
Indeed. The idea has been known for ages. It is also likely that a well-built tf-idf could outperform this method (contributions welcome). The goal of this repo is to provide a correct baseline for compression-based method that is simple and fast to reproduce. A compression-based method should either have a significantly better accuracy or a similar accuracy with better compute/memory performance.
Some compressor implementation, such as zstd
, can generate a compression dictionary from data.
A compression dictionary is essentially data used to seed the compressor state so it can achieve better compression. The idea is that if you are compressing a lot of similar pieces of data (e.g. JSON documents or anything sharing similar structure), then you can find common patterns across multiple objects then leverage those common patterns during compression and decompression operations to achieve better compression ratios. Dictionary compression is generally only useful for small inputs - data no larger than a few kilobytes.
Intuitively, this means: if I "train" a compressor on a specific topic, it will get better at compressing this particular topic. I can train a different compressor for each topic. Then given an input text, I can compare all compressor. The topic of the compressor that has the best compression ratio is most likely the topic of the input text. This collection of compressors is the text classification model.
Algorithm:
[(text, class)]
{
class_1: ct_1,
class_2: ct_2,
...
class_n: ct_n,
}
With ct_i
all texts of label i
, joined together with the separator \n
.
{
class_1: class1_compressor,
class_2: class2_compressor,
...
class_n: class3_compressor,
}
class_to_compressor
.text
predicted_class = None
minimum_size = INF
for class, compressor in class_to_compressor:
size = length(compressor.compress(text))
if size < minimum_size:
predicted_class = class
minimum_size = size
return predicted_class
To further improve the algorithm, instead of building a single compressor per class, we build a few of them. For instance, we split a given class text in 3 chunks, and train 1 compressor for each chunk. At inference, we take the average of the compression ratio (we could also use a vote approach). This stabilizes the inferences, and small values between 2 and 5 are enough. Below we call the number of compressors per class the CPC.
This technique has the following nice properties:
random()
here and there.dataset_prefixed
modedataset_prefixed
means the whole dataset is kept in memory by the compressor and used as a prefix. This makes the "training" very fast, because nothing is done.
This means the model has the same size as the training input. This corresponds to the setup of the gzip method.
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| Method |AG_NEWS| IMDB|AmazonReviewPolarity|DBpedia|YahooAnswers|YelpReviewPolarity|20News| R8 | R52 |Ohsumed|kinnews|
+------------------------------------+-------+-----+--------------------+-------+------------+------------------+------+-----+-----+-------+-------+
|FFTC ZSTD_CL9 dataset_prefixed CPC_1| 0.863 |0.691| 0.716 | 0.931 | 0.534 | 0.771 | 0.783|0.917|0.846| 0.478 | 0.892 |
+------------------------------------+-------+-----+--------------------+-------+------------+------------------+------+-----+-----+-------+-------+
|FFTC ZSTD_CL9 dataset_prefixed CPC_3| 0.896 |0.773| 0.799 | 0.959 | 0.628 | 0.841 | 0.795|0.935|0.007| 0.424 | 0.883 |
+------------------------------------+-------+-----+--------------------+-------+------------+------------------+------+-----+-----+-------+-------+
|FFTC ZSTD_CL9 dataset_prefixed CPC_5| 0.901 | 0.8 | 0.83 | 0.965 | 0.655 | 0.859 | 0.79 |0.937|0.001| 0.386 | 0.881 |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
Some datasets are not included because they have train/test split issues. See here.
Some other datasets are not included yet because the downloader function seemed broken. Help me fix this! See contribute
Comparison with A Parameter-Free Classification Method with Compressors (referred as gzip method below) on test accuracy:
method | AGNews | DBpedia | YahooAnswers | 20News | kinnews | R8 | R52 | Ohsumed |
---|---|---|---|---|---|---|---|---|
gzip method knn1 | 0.876 | 0.942 | 0.485 | 0.607 | 0.858 | 0.913 | 0.852 | 0.365 |
FFTC ZSTD_CL9 dataset_prefixed CPC_5 (this project) | 0.901 | 0.965 | 0.655 | 0.79 | 0.881 | 0.937 | NA [3] | 0.386 [3] |
FFTC ZSTD_CL9 dataset_prefixed CPC_1 (this project) | 0.846 | 0.478 |
I am not showing top 2 scores from the paper because the implementation leaks the labels at prediction time and overestimates accuracy. See issue.
Accuracy results of gzip method with knn1 from Ken Schutte work. See https://kenschutte.com/gzip-knn-paper
and https://kenschutte.com/gzip-knn-paper2
[3] having CPC > 1 is not relevant if there are many categories and few observations for some categories. It would require a more subtle implementation, with cpc depending on the number of observations for each class
As shown above, ftcc beats the gzip method on all datasets except R52.
For R52, because there are classes with few examples, learning is difficult and the brute force approach of gzip gets the advantage.
Still, it is trivial to have the same level of performance by using a compression level of 12 instead of 9 (python main.py -d R52 -cpc 1 -c ZSTD_CL12
).
Below is just to give an idea. Run on my 2021 intel MacBook Pro, with tons of apps running at the same time. Do your own microbenchmark. The computation is multiple orders of magnitudes faster than the gzip method.
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Method |AG_NEWS_train|AG_NEWS_predict_p90|IMDB_train|IMDB_predict_p90|AmazonReviewPolarity_train|AmazonReviewPolarity_predict_p90|DBpedia_train|DBpedia_predict_p90|YahooAnswers_train|YahooAnswers_predict_p90|YelpReviewPolarity_train|YelpReviewPolarity_predict_p90|20News_train|20News_predict_p90|R8_train|R8_predict_p90|R52_train|R52_predict_p90|kinnews_train|kinnews_predict_p90|
+------------------------------------+-------------+-------------------+----------+----------------+--------------------------+--------------------------------+-------------+-------------------+------------------+------------------------+------------------------+------------------------------+------------+------------------+--------+--------------+---------+---------------+-------------+-------------------+
|FFTC ZSTD_CL9 dataset_prefixed CPC_1| 2.4s | 0.123ms | 1.3s | 0.368ms | 111.7s | 0.136ms | 15.8s | 0.878ms | 48.1s | 1.363ms | 18.1s | 0.328ms | 0.6s | 6.537ms | 0.0s | 0.768ms | 0.1s | 5.846ms | 0.1s | 3.885ms |
+------------------------------------+-------------+-------------------+----------+----------------+--------------------------+--------------------------------+-------------+-------------------+------------------+------------------------+------------------------+------------------------------+------------+------------------+--------+--------------+---------+---------------+-------------+-------------------+
|FFTC ZSTD_CL9 dataset_prefixed CPC_3| 2.6s | 0.358ms | 1.2s | 1.462ms | 93.9s | 0.451ms | 13.3s | 2.547ms | 52.4s | 3.861ms | 21.1s | 0.939ms | 0.4s | 15.202ms | 0.0s | 2.408ms | 0.1s | 13.566ms | 0.1s | 8.92ms |
+------------------------------------+-------------+-------------------+----------+----------------+--------------------------+--------------------------------+-------------+-------------------+------------------+------------------------+------------------------+------------------------------+------------+------------------+--------+--------------+---------+---------------+-------------+-------------------+
|FFTC ZSTD_CL9 dataset_prefixed CPC_5| 3.0s | 0.696ms | 1.3s | 2.343ms | 109.8s | 0.877ms | 16.6s | 3.867ms | 50.5s | 6.263ms | 21.8s | 1.9ms | 0.4s | 28.908ms | 0.0s | 5.401ms | 0.1s | 24.708ms | 0.1s | 16.786ms |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
As expected, the bigger the CPC, the bigger the prediction time, especially when the number of classes is big. Training time is almost not impacted.
Reminder: dataset_prefixed
mode keeps the whole training dataset in memory. In an age where
multi-gigabytes NN model are welcome for classification, this can be considered as a viable solution. Most models are
a few megabytes only. Also, this corresponds to the gzip method setup.
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Method | AG_NEWS | IMDB |AmazonReviewPolarity| DBpedia | YahooAnswers|YelpReviewPolarity| 20News | R8 | R52 | kinnews |
+------------------------------------+------------+------------+--------------------+-------------+-------------+------------------+------------+-----------+-----------+-----------+
|FFTC ZSTD_CL9 dataset_prefixed CPC_1|28.497299 Mb|33.157821 Mb| 1553.96443 Mb |170.588956 Mb|730.885685 Mb| 407.052479 Mb |22.065911 Mb|3.320973 Mb|4.239561 Mb|6.578101 Mb|
+------------------------------------+------------+------------+--------------------+-------------+-------------+------------------+------------+-----------+-----------+-----------+
|FFTC ZSTD_CL9 dataset_prefixed CPC_3|28.497291 Mb|33.157817 Mb| 1553.964426 Mb |170.588928 Mb|730.885665 Mb| 407.052475 Mb |22.065871 Mb|3.320957 Mb|4.239464 Mb|6.578077 Mb|
+------------------------------------+------------+------------+--------------------+-------------+-------------+------------------+------------+-----------+-----------+-----------+
|FFTC ZSTD_CL9 dataset_prefixed CPC_5|28.497283 Mb|33.157813 Mb| 1553.964422 Mb | 170.5889 Mb |730.885645 Mb| 407.052471 Mb |22.065831 Mb|3.320941 Mb|4.239384 Mb|6.578053 Mb|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
size_unbounded_optimized
modesize_unbounded_optimized
means an algorithm automatically optimizes the dictionary, without bound on the size of the dictionary. See algorithm here.
Performance are a bit worse, while staying reasonable. The training time is bigger, but the model size is smaller than
the whole training set size.
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| Method |AG_NEWS| IMDB|AmazonReviewPolarity|DBpedia|YahooAnswers|YelpReviewPolarity|20News| R8 | R52 |kinnews|
+--------------------------------------------+-------+-----+--------------------+-------+------------+------------------+------+-----+-----+-------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_1| 0.864 |0.708| 0.704 | 0.92 | 0.528 | 0.756 | 0.773|0.914|0.838| 0.818 |
+--------------------------------------------+-------+-----+--------------------+-------+------------+------------------+------+-----+-----+-------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_3| 0.89 | 0.77| 0.788 | 0.955 | 0.622 | 0.834 | 0.773|0.928|0.002| 0.826 |
+--------------------------------------------+-------+-----+--------------------+-------+------------+------------------+------+-----+-----+-------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_5| 0.896 |0.799| 0.821 | 0.96 | 0.649 | 0.863 | 0.769|0.924|0.001| 0.853 |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
Roughly between 1 and 2 percentage points are lost for each method.
Below is just to give an idea. Run on my 2021 intel MacBook Pro. Do your own microbenchmark.
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Method |AG_NEWS_train|AG_NEWS_predict_p90|IMDB_train|IMDB_predict_p90|AmazonReviewPolarity_train|AmazonReviewPolarity_predict_p90|DBpedia_train|DBpedia_predict_p90|YahooAnswers_train|YahooAnswers_predict_p90|YelpReviewPolarity_train|YelpReviewPolarity_predict_p90|20News_train|20News_predict_p90|R8_train|R8_predict_p90|R52_train|R52_predict_p90|kinnews_train|kinnews_predict_p90|
+--------------------------------------------+-------------+-------------------+----------+----------------+--------------------------+--------------------------------+-------------+-------------------+------------------+------------------------+------------------------+------------------------------+------------+------------------+--------+--------------+---------+---------------+-------------+-------------------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_1| 26.3s | 0.179ms | 27.6s | 0.344ms | 891.4s | 0.093ms | 121.6s | 0.489ms | 470.1s | 0.795ms | 216.1s | 0.166ms | 14.8s | 4.034ms | 2.4s | 0.538ms | 3.2s | 4.212ms | 2.4s | 1.945ms |
+--------------------------------------------+-------------+-------------------+----------+----------------+--------------------------+--------------------------------+-------------+-------------------+------------------+------------------------+------------------------+------------------------------+------------+------------------+--------+--------------+---------+---------------+-------------+-------------------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_3| 22.7s | 0.283ms | 22.9s | 0.917ms | 1474.6s | 0.613ms | 287.9s | 3.247ms | 1322.8s | 4.864ms | 871.8s | 1.7ms | 50.2s | 35.201ms | 12.0s | 7.165ms | 14.5s | 34.682ms | 12.3s | 17.665ms |
+--------------------------------------------+-------------+-------------------+----------+----------------+--------------------------+--------------------------------+-------------+-------------------+------------------+------------------------+------------------------+------------------------------+------------+------------------+--------+--------------+---------+---------------+-------------+-------------------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_5| 61.4s | 0.957ms | 58.3s | 3.244ms | 2017.7s | 1.536ms | 337.7s | 5.771ms | 1074.1s | 10.738ms | 752.4s | 2.032ms | 38.6s | 41.63ms | 6.3s | 6.445ms | 8.6s | 33.346ms | 9.8s | 23.899ms |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Training is now a few minutes because of the dictionary training.
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Method | AG_NEWS | IMDB |AmazonReviewPolarity| DBpedia | YahooAnswers|YelpReviewPolarity| 20News | R8 | R52 | kinnews |
+--------------------------------------------+------------+------------+--------------------+-------------+-------------+------------------+------------+-----------+-----------+-----------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_1|17.568384 Mb|11.067007 Mb| 12.018115 Mb | 73.808126 Mb| 56.404952 Mb| 13.149491 Mb |16.098017 Mb| 2.50625 Mb|3.441763 Mb|1.445467 Mb|
+--------------------------------------------+------------+------------+--------------------+-------------+-------------+------------------+------------+-----------+-----------+-----------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_3|25.496669 Mb|23.231406 Mb| 35.56142 Mb |125.385563 Mb|163.297462 Mb| 38.984491 Mb |22.323596 Mb|3.128363 Mb|3.990758 Mb|2.130884 Mb|
+--------------------------------------------+------------+------------+--------------------+-------------+-------------+------------------+------------+-----------+-----------+-----------+
|FFTC ZSTD_CL9 size_unbounded_optimized CPC_5|27.610126 Mb|29.415687 Mb| 59.45308 Mb |144.977827 Mb|265.284513 Mb| 66.247472 Mb |25.642043 Mb|3.249029 Mb| 4.16616 Mb|2.742979 Mb|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Consider the datasets with the biggest training set size: AmazonReviewPolarity:
Accuracy | Size | Training Time | Inference speed p90 | |
---|---|---|---|---|
FFTC ZSTD_CL9 dataset_prefixed CPC_5 | 0.83 | 1553.96 Mb | 109.8s [1] | 1.536ms |
FFTC ZSTD_CL9 size_unbounded_optimized CPC_5 | 0.821 | 59.45 Mb | 2017.7s | 1.536ms |
[1] the time is actually spent joining the train examples together. Python is really slow at doing this. This could easily be optimized.
Main result:
We built a solution that shows the learning/generalization potential of compression dictionary based method.
We compress a 1.5Gb model in a 60Mb model - a 59 compression ratio - while only losing 1 percentage point of accuracy.
The inference time is not impacted by this compression, it stays fast.
Results are similar for YahooAnswers and Yelp with smaller compression ratio.
This suggests that compression dictionary training scales well and should be explored on bigger datasets.
For smaller datasets, keeping the whole dataset as a dictionary prefix is simple and results in extremely fast training and inference.
In the benchmark above we used a compression level of 9. Compression level can go up to 22. I have observed that compression level up to 12 will almost always yields an accuracy improvement. It makes training and inference slightly slower though. Above 12, the compression takes more time and I could not observe a consistent accuracy improvement.
Requirements
python 3.7
Install
pip install -r requirements.txt
dataset_prefixed
mode).Recommended to compare performance with the gzip method, or with setups that can afford to maintain the whole training dataset in memory (eg small dataset).
python main.py -s -1
This will train and evaluate for 30 models, so this takes some time - around 40 minutes on commodity hardware. The slow evaluation is caused by the AmazonReviewPolarity dataset that has 400000k. See how to select the dataset, compressors, cpc and size constraint below.
size_unbounded_optimized
).Recommended to compare performance with setups that can't afford to keep the whole training data in memory (big datasets, memory constraints or interested by the "learning" side of compression).
python main.py -s 0
This will train and evaluate for 30 models. Training takes some time with this mode because the dictionary is compressed. This should take around 4 hours on commodity hardware. See how to select the dataset, compressors, cpc and size constraint below.
Run on specific datasets
python main.py -d AG_NEWS -d IMDB
Run with specific compressors
python main.py -d AG_NEWS -c ZSTD_CL12
Run with specific CPC
python main.py -d AG_NEWS -cpc 1 -cpc 3
Run with specific memory constraints
python main.py -s -1 -s 0 -s 10000
-1
means the full dataset is used as a prefix by the dictionary (dataset_prefixed
mode)
-0
means the dictionary size is unbounded but optimized automatically (size_unbounded_optimized
mode)
10000
means each generated dictionary size is bounded to 10000 bytes.
See Accuracy performance for more info.
You can combine all the parameters together. For instance:
python main.py -d AG_NEWS -d IMDB -cpc 1 -cpc 3 -c ZSTD_CL9 -c ZSTD_CL12 -s -1 -s 0
To get the full help and see possible values for each parameters, run:
python main.py --help
The results are written to accuracy_results.csv
, speed_results.csv
, and size_results.csv
, in addition to being printed to the console.
For google drive links, you may have to comment line 22 to line 31 in torchtext._download_hooks.py
If you have the following error on mac OS:
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)>
You may have to run something like
ln -s /etc/ssl/* /Library/Frameworks/Python.framework/Versions/3.7/etc/openssl
If you use FTCC in your publication, please cite it by using the following BibTeX entry.
@Misc{ftcc,
title = {FTCC: Fast Text Classification with Compressors dictionary},
author = {Cyril de Catheu},
howpublished = {\url{https://github.com/cyrilou242/ftcc}},
year = {2023}
}