Discussion on future work

jamestiotio commented 1 year ago

Hi there!

First of all, I appreciate the team for putting in the work for this research paper. I would like to preface this by saying that my comments here should just be considered a point of discussion for possible future research related to this topic, either by the team or by the rest of the community. I am not implying that any significant changes to the current version of the paper are necessary. I hope that this is crystal clear. After all, I am not a formal researcher or reviewer of this paper. 😄

With that said, here are some of my comments on how future research can be done to improve the work done here:

Let's begin by stating that the overall idea of performing NLP using data compression is not new:
- The paper behind CompLearn Toolkit, Clustering by compression, was published in 2004.
- The Hutter Prize is still ongoing since 2006.
- Section 22.2.1 of the book Artificial Intelligence: A Modern Approach (3rd Edition), published in 2010, provides some discussion on text classification by data compression.
We can also find a plethora of related research for other domains:
- Malware Analysis and Forensics:
  - Using Normalized Compression Distance for Classifying File Fragments (2010, this uses Gzip, NCD, and KNN)
  - File Fragment Analysis Using Normalized Compression Distance (2013, this uses Gzip, NCD, and KNN)
  - Malware family classification via efficient Huffman features (2021, this uses NCD and KNN)
- Audio:
  - Music Classification: An Exhaustive Analysis of Automatic Genre Classification (2009, this uses Gzip, NCD, and KNN)
- Images:
  - Image Classification Using Data Compression Techniques (2010, this uses NCD and KNN)
- Medical:
  - Zgli: A Pipeline for Clustering by Compression with Application to Patient Stratification in Spondyloarthritis (2023, inspired by CompLearn, this uses Gzip and NCD)
There was also some work done in 2021 on text classification by data compression inspired by the AIMA textbook here, albeit not using KNN.

The paper's authors claim that they are "the first to use NCD with kNN for topic classification". After comparing this claim with the aforementioned list of research work, the following combination of concepts is the actual novel component:
- NCD as the distance metric used
- KNN as the classification method used
- Gzip as the compressor used
- Topic classification as the objective
- Text as the data type of interest
The combination above hinges on the fact that "text" is the data type of interest here, which is rather weak since "text" is a subset of binary data (i.e., text can be converted into bytes). The only significant difference between text and other types of binary data, such as images or executable files, would be its structural semantics.
- Exploring the usage of other distance metrics specified by previous research work, such as LZJD or SHWeL, could indicate a more meaningful contribution by a research paper instead of repeating most of the work done by previous research but with new datasets and datatypes. After all, the paper mentioned that "the compressor is data-type-agnostic by nature". Redoing a data-type-agnostic approach for different data types seems somewhat redundant (although I suppose that verifying the claim that it is indeed data-type-agnostic could be worthwhile as well).
- Exploring modifications to NCD can also provide more meaningful insights. For example, the paper "On normalized compression distance and large malware" (2016) suggested "interleaving" and "NCD-shuffle" as possible adaptations to NCD to improve its performance for large input data. Applying these techniques to text classification might yield interesting results.
The paper's authors highlight what distance metric (NCD), what compression algorithm (Gzip), and what classification method (KNN) is used. Instead of simply stating what selections were done, perhaps exploring the following questions in the future can provide greater insights:
- Why NCD, Gzip, and KNN were chosen? What advantages and disadvantages do they have over other distance metrics (LZJD, SHWeL, etc.), compressors (LZW12, zip, etc., on top of the ones already in the paper), compression schemes (Huffman coding, etc.), and classification methods (SVM, K-Means, K-Medoids, etc.) for this specific application of text classification?
- A comparison of both the speed and performance of the "NCD-Gzip-KNN" method against other combinations of distance metrics, compressors, and classifiers might also be insightful.
The paper mentioned that "the compressor is data-type-agnostic by nature and non-parametric methods do not introduce inductive bias during training". While this might be true for the specific method used by the paper, it might not be true for all compressors in general. For example, zstd builds an internal compression dictionary when it encounters the training data, which can potentially be reused to perform predictions on the test data. On the other hand, since gzip is based on the DEFLATE algorithm, it simply uses the previous 32K bytes as a source for potential matches. Meanwhile, zlib, another DEFLATE-based compressor, allows the user to even "prime" the compressor with 32K bytes.
- Do the 32K bytes used by gzip really not introduce any bias?
- Are compressors really "parameter-free"? Do the internal states of the compressor count as parameters, especially since that state directly helps the compressor to compress more effectively and efficiently?
- Would application-specific or domain-specific compressors help with classification? For example, instead of using general LZ77-based compression algorithms, compressing DNA sequences is usually done by using a custom algorithm (see this and this).
The main advantage of the approach stated by the paper is that it requires a lower amount of resources to perform as well as or even better than other more resource-intensive LLMs such as BERT.
- More comparisons with other LLMs can further highlight the competitiveness of the compression approach.
- Future work can also consider comparisons between this compression approach and other similar lower-cost alternatives, such as bag-of-words, Multinomial Naive Bayes, word embedding, etc. (on top of the ones already in the paper).

I hope that the paper's authors as well as other members of the research community can consider my comments and research questions as possible starting points for future research. Cheers and all the best!

Best regards, James Raphael Tiovalen

cyrilou242 commented 1 year ago

Hey @jamestiotio, you may want to look at this https://github.com/cyrilou242/ftcc It's an order of magnitudes faster compression based method, with similar accuracy performance, based on zstd.

CaltropHungerton commented 1 year ago

This looks interesting. It seems similar to DEFLATE, which seems to be better at preserving semantically important contiguities that are relevant for text classification than other algorithms (the dictionary coding is definitely important here) even though those algorithms might produce better compression ratios. There's a bit of a No-Free-Lunch situation at play here. It would be nice to somehow bound the potential gains in accuracy we could get by optimizing the compression algorithm for the specific problem of text classification. Then we could balance that with pragmatic concerns (speed, etc). Are DEFLATE style algorithms brushing up against the limit?

bazingagin / npc_gzip

Discussion on future work #48