Text models, uint8 for number of classes?

mitchellh commented 8 years ago

I don't know that much at the moment about ML so pardon me if this is ignorant. Is there a reason that the number of classes for text classification is limited to 255 via uint8? Would it be possible to increase this?

cdipaolo commented 8 years ago

It's definitely possible to increase it, but it's questionable (not necessarily a bad idea) as to whether this would be a good idea in practice.

A Bag of Words Naive Bayes Classifier (exactly the same as in the library), for example, only uses word frequencies in a document to influence the class choice. This means to have a high accuracy/precision/recall classifier on a text data set with even 256 classes you would need just the words in each class to describe accurately which of the 256 classes each data point is in.

This might be reasonable for classes that are basically asking about the words, such as includes_words_about_sky and includes_words_about_color_blue, but when you're trying to classify any more abstract concepts like sentiment_is_positive, sentiment_is_slightly_less_positive, sentiment_is_a_little_more_less_positive, you can see how this might get tricky.

As a practical example of higher numbers of classes being used in a Naive Bayes classifier, look at one of the 2014 winners of the Yelp Dataset Challenge in classifying the stars of restaurant reviews. Looking at their confusion matrix of the best model even for five classes this gets cloudy. It's easy to differentiate between 5 star and 1 star reviews, but differentiating between 5 star and 4 star reviews using all length 2 sequences of words ('bigrams' - "I love dogs" -> "I love", "love dogs" | 'unigrams' - "I love dogs" -> "I", "love", "dogs") is pretty hard without knowing more latent information about word position, etc.. I've done a very similar experiment as this team with a more recent version of the data set and got similar results using sci-py.

To answer your question, I'm wondering as to whether you have a use case with text data in >255 classes that would potentially have not-too-bad accuracy with a position-agnostic model? I'm not saying it isn't possible but I'm curious as to whether there is a use case with this many classes for such a model?

mitchellh commented 8 years ago

Yep, you're right. I was able to break it down into multiple bayes models and it worked great. :)

cdipaolo / goml

Text models, uint8 for number of classes? #6