Closed jsalbr closed 3 years ago
Thanks for the report! That is indeed a bug in the updated alignment code, which mistakenly assumed that you wouldn't get whitespace tokens back from the stanza models (since I wasn't aware of the extra spacy tokenizer option). The fix in #44 should address this and it will be in the next release (v0.2.4). If you want to install from source in the meanwhile (wait until after the PR is merged!), you can run:
pip install https://github.com/explosion/spacy-stanza/archive/master.zip
Note that the stanza models can't really handle the whitespace token, though. I get the token analysis:
{
"id": "7",
"text": "\n",
"lemma": "\n",
"upos": "NOUN",
"xpos": "NN",
"feats": "Number=Sing",
"head": 8,
"deprel": "compound",
"misc": "start_char=31|end_char=32"
},
So you might want to consider replacing extra whitespace with single spaces anyway, at least with the provided stanza models.
Thanks Adriane for fixing. The workaround is simple once you've found the reason ;-)
Hi Ines, sorry that I ran into a small bug. I could at least track down the symptoms. The problem occurs if I use stanza with the spacy tokenizer and my text contains a newline. The obvious workaround is this one:
text = re.sub(r'\s+', ' ', text)
Here the trace: