explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.84k stars 4.38k forks source link

Casing substantially impacts parsing... #405

Closed brandoncarl closed 8 years ago

brandoncarl commented 8 years ago

Hello – thanks much for the great repository. I'm running into a peculiar error that doesn't seem to impact other parsers:

Namely, the following parses What correctly as a det

What theaters hold 200 people?

meanwhile, if we change the case, what incorrectly as a dobj

what theaters hold 200 people?

Cheers

brandoncarl commented 8 years ago

As an additional note, it appears that this always parses incorrectly within v0.101, but works selectively on the displaCy...

honnibal commented 8 years ago

Thanks for this.

Previous models used a crude form of data augmentation --- we would randomly lower-case documents, replace punctuation, etc. This had only a very small positive effect on accuracy on our formal evaluations, because the evaluation data tends to be quite well cased. I always guessed it might help on less well formed text, but I never evaluated this properly.

I'll switch the data augmentation back on for the next model uploaded, and try to get a better evaluation.

brandoncarl commented 8 years ago

Of course, and thanks again for the work!

If I manually add training examples to the tagger, does this augment rather than overwrite model? ie I'd like to overcome this problem by augmenting the data.

On Jun 3, 2016, 14:12 -0400, Matthew Honnibalnotifications@github.com, wrote:

Thanks for this.

Previous models used a crude form of data augmentation --- we would randomly lower-case documents, replace punctuation, etc. This had only a very small positive effect on accuracy on our formal evaluations, because the evaluation data tends to be quite well cased. I always guessed it might help on less well formed text, but I never evaluated this properly.

I'll switch the data augmentation back on for the next model uploaded, and try to get a better evaluation.

— You are receiving this because you authored the thread. Reply to this email directly,view it on GitHub(https://github.com/spacy-io/spaCy/issues/405#issuecomment-223652516), ormute the thread(https://github.com/notifications/unsubscribe/AFLTAzT6IcyaquCrZbcogMn9h0k1moK5ks5qIG6SgaJpZM4IsyM8).

honnibal commented 8 years ago

It should augment it, yes. But, be aware that it might still not do quite what you expect...Imagine you do a bunch of updates on your few labelled examples. As you update these, it's not being constrained to remain accurate on other text at the same time. So, you might drift off a lot.

One solution is to tag a lot of text, and mix your annotations in. This is like setting the objective, "I want a model that behaves as it used to, except that it tags these other things right."

Another useful way to adapt the model is to make sure the Brown cluster features are being set correctly. If you have a word that's not known to the tagger, you can give it a big clue by setting its Brown cluster to match a word that behaves similarly.

In fact the tagger responds to a very, very small set of features, so one way you might be able to manipulate the model is by manipulating the calculation of these. For instance, in the example below, I try to manually fiddle with the features until "sanders" gets tagged as an NNP.

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u"some entities like sanders will always be difficult.")
>>> print([(w.text, w.tag_) for w in doc])
[(u'some', u'DT'), (u'entities', u'NNS'), (u'like', u'IN'), (u'sanders', u'NNS'), (u'will', u'MD'), (u'always', u'RB'), (u'be', u'VB'), (u'difficult', u'JJ'), (u'.', u'.')]
>>> doc = nlp(u"some entities like Bush will always be difficult.")
>>> print([(w.text, w.tag_) for w in doc])
[(u'some', u'DT'), (u'entities', u'NNS'), (u'like', u'IN'), (u'Bush', u'NNP'), (u'will', u'MD'), (u'always', u'RB'), (u'be', u'VB'), (u'difficult', u'JJ'), (u'.', u'.')]
>>> sanders = nlp.vocab[u'sanders']
>>> sanders.cluster
461
>>> bush = nlp.vocab[u'Bush']
>>> bush.cluster
22
>>> sanders.cluster = bush.cluster
>>> doc = nlp(u"some entities like sanders will always be difficult.")
>>> print([(w.text, w.tag_) for w in doc])
[(u'some', u'DT'), (u'entities', u'NNS'), (u'like', u'IN'), (u'sanders', u'NNS'), (u'will', u'MD'), (u'always', u'RB'), (u'be', u'VB'), (u'difficult', u'JJ'), (u'.', u'.')]
>>> sanders.is_title = True
>>> print([(w.text, w.tag_) for w in doc])
[(u'some', u'DT'), (u'entities', u'NNS'), (u'like', u'IN'), (u'sanders', u'NNS'), (u'will', u'MD'), (u'always', u'RB'), (u'be', u'VB'), (u'difficult', u'JJ'), (u'.', u'.')]
>>> sanders.suffix_ = u'ush'
>>> print([(w.text, w.tag_) for w in doc])
[(u'some', u'DT'), (u'entities', u'NNS'), (u'like', u'IN'), (u'sanders', u'NNS'), (u'will', u'MD'), (u'always', u'RB'), (u'be', u'VB'), (u'difficult', u'JJ'), (u'.', u'.')]
>>> sanders.prefix_ = u'B'
>>> doc = nlp(u"some entities like sanders will always be difficult.")
>>> print([(w.text, w.tag_) for w in doc])
[(u'some', u'DT'), (u'entities', u'NNS'), (u'like', u'IN'), (u'sanders', u'NNP'), (u'will', u'MD'), (u'always', u'RB'), (u'be', u'VB'), (u'difficult', u'JJ'), (u'.', u'.')]
>>> sanders.prefix_ = u's'
>>> doc = nlp(u"some entities like sanders will always be difficult.")
>>> print([(w.text, w.tag_) for w in doc])
[(u'some', u'DT'), (u'entities', u'NNS'), (u'like', u'IN'), (u'sanders', u'NN'), (u'will', u'MD'), (u'always', u'RB'), (u'be', u'VB'), (u'difficult', u'JJ'), (u'.', u'.')]

The line sanders = nlp.vocab[u'sanders'] is fetching a Lexeme object. Writes to this object are saved in the vocab, and will provide the source for all features for tokens of that type. So if you write different feature values into the vocab, you'll get different results from the statistical model.

This is sort of a round-about way to go about things. But it might be helpful to you.

brandoncarl commented 8 years ago

@honnibal That is, thank you.

It seems that a large augmented training corpus is probably the best route forward. Two quick questions (and then I'll close the issue):

  1. Is the training data available publicly? (for me to augment)
  2. How can I re-run the benchmarks myself? (to know how much my training impacted broader performance)

Thanks!

honnibal commented 8 years ago

You can obtain the training data via the Linguistic Data Consortium --- it's the OntoNotes 5 corpus. However, a commercial license will cost 25,000.

brandoncarl commented 8 years ago

Ha, no problem. I'll just write a check from my personal...nope, nevermind! Thanks for the response!

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.