(CS model v2.5) - strange problem with "aby" word

fukidzon commented 4 years ago

I'm used spacy-udpipe to lemmatize all words from a word2vec model, but it crashed with some one-word texts - Interestingly, that problematic were only 3 words out of 2mil: ['aby', 'Aby', 'ABY'] ...See the code and error:

import spacy_udpipe
spacy_udpipe.download('cs')
# the downloaded file is czech-pdt-ud-2.5-191206.udpipe
nlp = spacy_udpipe.load('cs')
word = 'aby'
doc = nlp(word)

Error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-d64aeb0566d4> in <module>
      4 nlp = spacy_udpipe.load('cs')
      5 word = 'aby'
----> 6 doc = nlp(word)

~/kajo/kajo_semexp_core/lib/python3.6/site-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
    429                 Errors.E088.format(length=len(text), max_length=self.max_length)
    430             )
--> 431         doc = self.make_doc(text)
    432         if component_cfg is None:
    433             component_cfg = {}

~/kajo/kajo_semexp_core/lib/python3.6/site-packages/spacy/language.py in make_doc(self, text)
    455 
    456     def make_doc(self, text):
--> 457         return self.tokenizer(text)
    458 
    459     def _format_docs_and_golds(self, docs, golds):

~/kajo/kajo_semexp_core/lib/python3.6/site-packages/spacy_udpipe/language.py in __call__(self, text)
    230                 )
    231             else:
--> 232                 raise e
    233         # Overwrite lemmas separately to prevent overwritting by spaCy
    234         lemma_array = numpy.array(

~/kajo/kajo_semexp_core/lib/python3.6/site-packages/spacy_udpipe/language.py in __call__(self, text)
    218             doc = Doc(self.vocab,
    219                       words=words,
--> 220                       spaces=spaces).from_array(attrs, array)
    221         except ValueError as e:
    222             if '[E167]' in str(e):

doc.pyx in spacy.tokens.doc.Doc.from_array()

ValueError: [E190] Token head out of range in `Doc.from_array()` for token index '0' with value '1' (equivalent to relative head index: '1'). The head indices should be relative to the current token index rather than absolute indices in the array.

I'm using Python 3.6.9, spacy==2.2.4, spacy-udpipe==0.2.0

It worked OK with czech-pdt-ud-2.4-190531.udpipe, spacy==2.2.3, spacy-udpipe==0.1.0 and it works without error with the newer version of spacy and spacy-udpipe when I manually loaded the model as nlp = spacy_udpipe.load_from_path(lang="cs", path="<path_to>/czech-pdt-ud-2.4-190531.udpipe", meta={"description": "CS model"}) - so this strange issue seems to be in the model czech-pdt-ud-2.5-191206.udpipe

KoichiYasuoka commented 4 years ago

I've found that czech-pdt-ud-2.5-191206.udpipe splits the single word "aby" into two words "aby by". It seems the same error as #15. I thought PR #17 would debug it, but PR #17 did not pass some checks. How do you think, @asajatovic?

>>> import spacy_udpipe
>>> nlp=spacy_udpipe.load("cs")
>>> model=nlp.tokenizer.model
>>> raw_udpipe=lambda t:model.write(model(t),"conllu")
>>> doc=raw_udpipe("aby")
>>> print(doc)
# newdoc
# newpar
# sent_id = 1
# text = aby
1-2     aby     _       _       _       _       _       _       _       SpaceAfter=No
1       aby     aby     SCONJ   J,------------- _       2       mark    _      _
2       by      být     AUX     Vc------------- Mood=Cnd|Person=3|VerbForm=Fin 0    root    _       _

asajatovic commented 4 years ago

Fixed in #17

M4hakala commented 2 years ago

It looks like fix is not working for this "aby by" issue. I'm using latest version and I'm struggling with this issue still @asajatovic could you please take a look on it?

import spacy_udpipe
nlp = spacy_udpipe.load('cs')
print([t for t in nlp('aby')])

[aby, by]

TakeLab / spacy-udpipe

(CS model v2.5) - strange problem with "aby" word #14