explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.17k stars 4.4k forks source link

Why did spaCy chose a deep NN for clasification model? #1550

Closed trivedigaurav closed 7 years ago

trivedigaurav commented 7 years ago

spaCy 2.0 has a text classifier using deep neural net models built into their pipeline. I was wondering if it would be useful to add a linear models or SVM as well. These models seems to perform just as well, and take a fraction of the training time...

Here are my notebooks comparing them:

  1. LinearSVC using scikit https://gist.github.com/trivedigaurav/41566f053055ae373c8c09a33be9848c
  2. Spacy's sample text classification script: https://gist.github.com/trivedigaurav/a0bd35fd25e0e50ae2da8e845508d8ee

Info about spaCy

Manslow commented 7 years ago

https://spacy.io/usage/v2#features-models states that:

spaCy v2.0 features new neural models for tagging, parsing and entity recognition. The models have been designed and implemented from scratch specifically for spaCy, to give you an unmatched balance of speed, size and accuracy. The new models are 10× smaller, 20% more accurate, and even cheaper to run than the previous generation.

spaCy v2.0's new neural network models bring significant improvements in accuracy, especially for English Named Entity Recognition. The new en_core_web_lg model makes about 25% fewer mistakes than the corresponding v1.x model and is within 1% of the current state-of-the-art (Strubell et al., 2017). The v2.0 models are also cheaper to run at scale, as they require under 1 GB of memory per process.

I'm supposing that their particular validation methods are fairly complete and the other practical aspects concerning speed and size have also been addressed which are important to make their models more widely adopted and applicable.

trivedigaurav commented 7 years ago

Thanks @Manslow. I am not sure if they claim anything about text classification models though in that...

honnibal commented 7 years ago

@trivedigaurav If I'm reading your results, the scores from spaCy's CNN are indeed better on this task, right? spaCy hits 0.9 F-score after the second epoch. Your linear model score is 0.84

The textcat model actually stacks a unigram bag of words and the CNN model. So yes you can use only the linear model, and it's very fast. For linear model text classification, the functionality in both scikit-learn and Vowpal Wabbit is very good. I have nothing to add to this.

But deep learning can perform better for text classification on some tasks, especially on short texts with little training data. And the deep learning text classifiers available elsewhere are really bad! That's why I've included the model in spaCy.

The Keras example scripts for IMDB text classification get to like 82%, using all 25000 examples. The recipe in spaCy gets to 90% using just 1600 examples. The Keras recipe does so poorly (much worse than unigram bag of words) because they truncate the training data. That's pretty dumb for reviews --- people say whether they liked the thing in the last sentence a lot of the time.

trivedigaurav commented 7 years ago

Yes, that's correct. CNN results are better specially with recall measures but they take considerably longer time to train.

I understand your point, but I was not sure if CNNs are known to outperform other methods in text classification yet. Otherwise, I have usually found spaCy's choices to be consistent with policy of picking the best state of the art methods available.

honnibal commented 7 years ago

You do raise a good point. We're still working on full tutorials for the textcat, so I'll think about how to explain this.

For your reference, this is the Yang et al (2016) paper: http://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf

jmrosen155 commented 6 years ago

@honnibal There doesn't yet seem to be much documentation on the specifics of the unigram bag of words + CNN algo behind the textcat model in spaCy. Is the model based on one of the models from the Yang et al paper mentioned above? Or is there another resource you can share which discusses the details of the model? Thanks.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.