explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.78k stars 4.37k forks source link

Subtree doesn't agree with right edge? #251

Closed jerryr56 closed 8 years ago

jerryr56 commented 8 years ago

First of all, let me say that this is a fantastic package! After struggling with NLTK and Stanford Core NLP, I am so excited to have found this. Thank You Matthew! While I'm here, I might as well introduce myself: I'm hoping to use the parser to help evaluate a proposed set of literary parallels between the Gospel of Luke, and the histories of Josephus. More about the project is at:

http://postflaviana.org/introduction-flavian-origins-theory-christianity/

http://postflaviana.org/community/index.php?threads/flavian-signature-verification-project-progress-report.1588

My example text for this issue is from the first sentence in 'The Jewish Wars' by Flavius Josephus, Whiston translation. In this example, the subtree of the word 'for' is generated correctly by t.subtree ('for the sake of'), but t.right_edge incorrectly extends to the word 'as'.

This almost always works as I would expect. (That is, the right_edge corresponds with the last token in the subtree.) This particular piece of twisted 18th century syntax is the only counter-example I've stumbled across.

Of course the obvious work-around is to use the subtree expansion to find the right edge. However, I'm mentioning it here because the difference in this case seems to hint at some interesting underlying confusion in the parser, as to what to do with the 'such as' clause?

In the console:

from spacy import English
enlp=English()

doc=enlp('I have proposed to myself, for the sake of such as live under the government of the Romans, to translate those books into the Greek tongue.')

t=doc[6]
t
>> for 
list(t.subtree)
>> [for , the , sake , of ]
t.right_edge
>> as 
honnibal commented 8 years ago

Thanks for the report!

When analysing tricky sentences like this, the parser has the ability to 'change its mind', and modify the analysis it's previously built, on-the-fly. You can catch it in the act if you step-through the sentence on the visualizer, displaCy: http://spacy.io/displacy/

The bug occurred in the implementation of this type of transition. When the parser makes a left-arc that "clobbers" an existing arc, we have to delete the arc that was there before, and recalculate the rightward edge of the former parent. This was being done correctly, but we weren't also recalculating the rightward edges of all higher ancestors in the tree.

I have a preliminary fix, but I need to rerun some accuracy evaluations before releasing it. There's some chance this was masking another issue, since the rightward edges are used to calculate a lot of features.

honnibal commented 8 years ago

The fix for this should be out now, in v0.100.5. Please reopen if it doesn't work!

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.