Invalid parse tree state

buhrmann commented 5 years ago

Hi, it seems that in some cases of using StanfordNLP models the result is an invalid parse tree state. When trying to merge certain spans, I get RuntimeError: [E039] Array bounds exceeded while searching for root word. This likely means the parse tree is in an invalid state.

Here is a reproducible example (at least for my installation), failing when trying to merge an emoji:

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage

# stanfordnlp.download('ca')
snlp = stanfordnlp.Pipeline(lang='ca')
ca = StanfordNLPLanguage(snlp)

txt = "🙅🚫 Els comentaris i els gestos ofensius o els tocaments indesitjats són violència masclista.  💬 Si vius alguna d’aquestes situacions, denuncia-ho a @mossos i informa’ns a través de l’app #BCNantimasclista per prevenir-les. 💪 #JuntesSomMésFortes!  ℹ️ https://t.co/gOnPU9vgdt https://t.co/qtntxX97Ih"

doc = ca(txt)
print(list(doc))
doc[16:17].merge()

This doesn't seem to happen with a regular Spacy language (the tokenization is slightly different, but merging spans including the same emoji works here):

import spacy
en = spacy.load('en')
doc = en(txt)
print(list(doc))
print(list(doc[16:18]))
doc[17:18].merge()
doc[16:18].merge()
print(list(doc))

ines commented 5 years ago

Thanks for the report! The error seems to occur here when spaCy is trying to count the words to the root and fails.

I tried it locally, but I haven't been able to reproduce the problem 🤔 I ran your exact code, with both spaCy v2.0.18 and spaCy v2.1.3. I also tried the new doc.retokenize contect manager for comparison, and tried merging various combinations of tokens including the emoji.

buhrmann commented 5 years ago

Hmm, ok, I'll try to check if it's to do with the version then.

ines commented 5 years ago

Do you have anything else set up in your pipeline by any chance? Like, spacymoji etc.?

buhrmann commented 5 years ago

Hm, no, it failed with the exact code sample above, though now suddenly it seems to work! I'm not sure if the version of any related package has been updated in my environment to be honest, though I'm pretty sure neither spacy nor spacy_stanfordnlp have changed. My best guess is that perhaps the stanford model itself has been updated on the servers, but really I don't know... In any case, I think this can be closed as not reproducible.

explosion / spacy-stanza

Invalid parse tree state #11