TakeLab / spacy-udpipe

spaCy + UDPipe
MIT License
159 stars 11 forks source link

Error 190 with pt-bosque model #15

Closed rahonalab closed 4 years ago

rahonalab commented 4 years ago

Hi, I don't know if it is relevant here or should I address pt-bosque model developers, but I get the following error using the pt-bosque model i.e. spacy_udpipe.load("pt-bosque")

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 431, in __call__ doc = self.make_doc(text) File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 457, in make_doc return self.tokenizer(text) File "/usr/local/lib/python3.7/site-packages/spacy_udpipe/language.py", line 232, in __call__ raise e File "/usr/local/lib/python3.7/site-packages/spacy_udpipe/language.py", line 220, in __call__ spaces=spaces).from_array(attrs, array) File "doc.pyx", line 814, in spacy.tokens.doc.Doc.from_array ValueError: [E190] Token head out of range inDoc.from_array()for token index '14' with value '27' (equivalent to relative head index: '27'). The head indices should be relative to the current token index rather than absolute indices in the array.

while analyzing the text:

text = "– Não sei. Harry olhou desesperado para os lados. Black e Lupin, os dois tinham se ido... não havia mais nenhum adulto em sua companhia exceto Snape, que ainda flutuava, inconsciente, no ar."

The error is not thrown up when using the 'default' Portoguese model, pt-gsd, which is loaded as 'pt', or spacy with its own Portoguese model.

Thank you in advance!

asajatovic commented 4 years ago

@rahonalab has this started to happen just in the latest spacy-udpipe version? I am asking because of the related issue #14

rahonalab commented 4 years ago

Honestly I don't know, because I have started analysing Portoguese with the latest spacy-udpipe version...

KoichiYasuoka commented 4 years ago
>>> import spacy_udpipe
>>> nlp=spacy_udpipe.load("pt-bosque")
>>> doc=nlp("no ar.")
>>> print(doc)
em o ar

It seems "no" (= "em o") cannot be handled correctly, and the period has gone away. Umm... Well @asajatovic, how do you handle multiword tokens?

# text = no ar.
1-2 no  _   _   _   _   _   _   _   _
1   em  em  ADP _   _   3   case    _   _
2   o   o   DET _   _   3   det _   _
3   ar  ar  NOUN    _   _   0   root    _   SpaceAfter=No
4   .   .   PUNCT   _   _   3   punct   _   SpaceAfter=No
KoichiYasuoka commented 4 years ago

But I'm vague whether PR #17 works well for other languages...

asajatovic commented 4 years ago

@KoichiYasuoka multiword tokens are treated as single word tokens. Far from an ideal solution, but no issues were raised until now.

asajatovic commented 4 years ago

Fixed in #17