Open jamestiotio opened 1 year ago
Hey @jamestiotio, you may want to look at this https://github.com/cyrilou242/ftcc
It's an order of magnitudes faster compression based method, with similar accuracy performance, based on zstd
.
This looks interesting. It seems similar to DEFLATE, which seems to be better at preserving semantically important contiguities that are relevant for text classification than other algorithms (the dictionary coding is definitely important here) even though those algorithms might produce better compression ratios. There's a bit of a No-Free-Lunch situation at play here. It would be nice to somehow bound the potential gains in accuracy we could get by optimizing the compression algorithm for the specific problem of text classification. Then we could balance that with pragmatic concerns (speed, etc). Are DEFLATE style algorithms brushing up against the limit?
Hi there!
First of all, I appreciate the team for putting in the work for this research paper. I would like to preface this by saying that my comments here should just be considered a point of discussion for possible future research related to this topic, either by the team or by the rest of the community. I am not implying that any significant changes to the current version of the paper are necessary. I hope that this is crystal clear. After all, I am not a formal researcher or reviewer of this paper. 😄
With that said, here are some of my comments on how future research can be done to improve the work done here:
Let's begin by stating that the overall idea of performing NLP using data compression is not new:
We can also find a plethora of related research for other domains:
There was also some work done in 2021 on text classification by data compression inspired by the AIMA textbook here, albeit not using KNN.
The paper's authors claim that they are "the first to use NCD with kNN for topic classification". After comparing this claim with the aforementioned list of research work, the following combination of concepts is the actual novel component:
The combination above hinges on the fact that "text" is the data type of interest here, which is rather weak since "text" is a subset of binary data (i.e., text can be converted into bytes). The only significant difference between text and other types of binary data, such as images or executable files, would be its structural semantics.
The paper's authors highlight what distance metric (NCD), what compression algorithm (Gzip), and what classification method (KNN) is used. Instead of simply stating what selections were done, perhaps exploring the following questions in the future can provide greater insights:
LZW12
,zip
, etc., on top of the ones already in the paper), compression schemes (Huffman coding, etc.), and classification methods (SVM, K-Means, K-Medoids, etc.) for this specific application of text classification?The paper mentioned that "the compressor is data-type-agnostic by nature and non-parametric methods do not introduce inductive bias during training". While this might be true for the specific method used by the paper, it might not be true for all compressors in general. For example,
zstd
builds an internal compression dictionary when it encounters the training data, which can potentially be reused to perform predictions on the test data. On the other hand, sincegzip
is based on the DEFLATE algorithm, it simply uses the previous 32K bytes as a source for potential matches. Meanwhile,zlib
, another DEFLATE-based compressor, allows the user to even "prime" the compressor with 32K bytes.gzip
really not introduce any bias?The main advantage of the approach stated by the paper is that it requires a lower amount of resources to perform as well as or even better than other more resource-intensive LLMs such as BERT.
I hope that the paper's authors as well as other members of the research community can consider my comments and research questions as possible starting points for future research. Cheers and all the best!
Best regards, James Raphael Tiovalen