cyrilou242 / ftcc

Fast Text Classification with Compressors dictionary
MIT License
146 stars 10 forks source link

Using short text for text classification #3

Closed icang1694 closed 1 year ago

icang1694 commented 1 year ago

i have been exploring the possibilities of using ftcc to doing text classification but for short text, but so far i get bad accuracy, any advice on applying this for short text data text classification? thank you

CaltropHungerton commented 1 year ago

It's difficult to get compression gains on short strings. No way around it. It's less likely for there to be redundancies/patterns for the compressor to use. For some REALLY short strings, compressing them will make them longer! (because of how compressed strings are formatted/metadata)

flipz357 commented 1 year ago

@CaltropHungerton As the gzip paper does it, I agree with what you're saying. But actually for the way ftcc does it, it should be robust against short documents. The label-wise compressor is "learned" by concatenating all texts of a label together. So when you compare a new text against a compressor, even if it is short, the ftcc prediction should still be kind of meaningful.

cyrilou242 commented 1 year ago

@icang1694 @CaltropHungerton ftcc will not perform well by design for short text. Even if the "learning" may be ok, each compression can be seen as a random draw of a normal variable with variance bigger when text is smaller. So with very small text it's very likely that one of the compressor will have better compression than the compressor corresponding to the expected class.

As mentioned by @CaltropHungerton there is also

For some REALLY short strings, compressing them will make them longer

I need to check if metadata should be fixed in my implem btw.

@icang1694 ftcc is not a good choice for short sentence classification. I'd suggest to finetune a language model on your classification task. In short sentences stopwords, punctuations, grammatical structures and subtle semantics are important. Language model capture those way better than traditional approaches like tf-idf (and ftcc).