explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

Assertion error in makedoc if spacy tokenizer is used in stanza and text contains newline #43

Closed jsalbr closed 3 years ago

jsalbr commented 4 years ago

Hi Ines, sorry that I ran into a small bug. I could at least track down the symptoms. The problem occurs if I use stanza with the spacy tokenizer and my text contains a newline. The obvious workaround is this one: text = re.sub(r'\s+', ' ', text)

# spacy.__version__ # 2.3.0
# stanza.__version__ # 1.0.1
# spacy_stanza.__version__ # 0.2.3

text = "The FHLBB was insolvent and its\nassets were transferred. "
# works if \n in text is replaced by a space

import spacy, stanza, spacy_stanza
from spacy_stanza import StanzaLanguage

# stanza nlp works fine
stanza_nlp = stanza.Pipeline('en'), processors={'tokenize': 'spacy'})
doc = stanza_nlp(text)

# spacy stanza throws assertion 
spacy_stanza_nlp = StanzaLanguage(stanza_nlp)
doc = spacy_stanza_nlp.make_doc(text)

Here the trace:

---------------------------------------------------------------------------
AssertionError                  Traceback (most recent call last)
---> 22 doc = spacy_stanza_nlp.make_doc(text)

.../spacy-stanza/spacy_stanza/language.py in make_doc(self, text)
     65         these will be mapped to token vectors.
     66         """
---> 67         doc = self.tokenizer(text)
     68         if self.svecs is not None:
     69             doc.user_token_hooks["vector"] = self.token_vector

.../spacy-stanza/spacy_stanza/language.py in __call__(self, text)
    193             else:
    194                 token = snlp_tokens[i + offset]
--> 195                 assert word == token.text
    196 
    197                 pos.append(self.vocab.strings.add(token.upos or ""))

AssertionError: 
adrianeboyd commented 4 years ago

Thanks for the report! That is indeed a bug in the updated alignment code, which mistakenly assumed that you wouldn't get whitespace tokens back from the stanza models (since I wasn't aware of the extra spacy tokenizer option). The fix in #44 should address this and it will be in the next release (v0.2.4). If you want to install from source in the meanwhile (wait until after the PR is merged!), you can run:

pip install https://github.com/explosion/spacy-stanza/archive/master.zip

Note that the stanza models can't really handle the whitespace token, though. I get the token analysis:

{
  "id": "7",
  "text": "\n",
  "lemma": "\n",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 8,
  "deprel": "compound",
  "misc": "start_char=31|end_char=32"
},

So you might want to consider replacing extra whitespace with single spaces anyway, at least with the provided stanza models.

jsalbr commented 4 years ago

Thanks Adriane for fixing. The workaround is simple once you've found the reason ;-)