explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.34k stars 4.33k forks source link

More edge issues? #258

Closed jerryr56 closed 7 years ago

jerryr56 commented 8 years ago

Hello Matthew,

Thanks for the amazingly quick response on my last open ticket. I downloaded the fix today and installed it, and ran a bigger slug of data: all ~2600 paragraphs (700,000 words) from Josephus War & Antiquities. I think I'm still seeing a couple of glitches caused by this unusual syntax. Again there's no urgency about this from my end (workarounds are obvious) but thought you might like to know.

Example 1: in the 2nd sentence, the root is the verb 'was', at position 25 in the document. The minimum index in the subtree is position 24, while the left edge is at 23. This error depends on the inclusion of the first sentence! If the document consists of the second sentence alone, there's no problem.

In[12]: doc = enlp('He was certainly a very happy man, and afforded no occasion to have any complaint made of fortune on his account. He it was who alone had three of the most desirable things in the world,--the government of his nation, and the high priesthood, and the gift of prophecy.')
In[13]: doc[25]
Out[13]: was 
In[14]: root=doc[25]
In[15]: ilist=list(t.i for t in root.subtree)
In[16]: min(ilist)
Out[16]: 24
In[17]: root.left_edge.i
Out[17]: 23

Example 2 seems to be a result of the fact that I'm still using NLTK for my document pre-processing, and joining the tokens back together for feeding back to Spacy. The result is that (for example) the word "Caesar's" below, is separated into three separate tokens, instead of just two. The root 'sat' at position 38 is not the root of the entire sentence, but just the root of the 2nd clause. The right edge is way off for this input.

In[51]: doc=enlp("The presidents sat first, as Caesar ' s letters had appointed, who were Saturninus and Pedanius, and their lieutenants that were with them, with whom was the procurator Volumnius also; next to them sat the king ' s kinsmen and friends, with Salome also, and Pheroras; after whom sat the principal men of all Syria, excepting Archelaus; for Herod had a suspicion of him, because he was Alexander ' s father-in-law. ")
In[52]: root=doc[38]
In[53]: root
Out[53]: sat 
In[54]: ilist=list(t.i for t in root.subtree)
In[55]: max(ilist)
Out[55]: 41
In[56]: root.right_edge.i
Out[56]: 84

By the way, all my Python 3.5 stuff (including Spacy) has been installed into a system directory /Library/Frameworks/Python.framework... under OS X 10.11.3 before I read your advice to avoid installing into system directories; and now I have no idea how I would go about installing anywhere else! But I don't think I'm seeing any problems as a result of this.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.