Closed fukidzon closed 4 years ago
I've found that czech-pdt-ud-2.5-191206.udpipe
splits the single word "aby" into two words "aby by". It seems the same error as #15. I thought PR #17 would debug it, but PR #17 did not pass some checks. How do you think, @asajatovic?
>>> import spacy_udpipe
>>> nlp=spacy_udpipe.load("cs")
>>> model=nlp.tokenizer.model
>>> raw_udpipe=lambda t:model.write(model(t),"conllu")
>>> doc=raw_udpipe("aby")
>>> print(doc)
# newdoc
# newpar
# sent_id = 1
# text = aby
1-2 aby _ _ _ _ _ _ _ SpaceAfter=No
1 aby aby SCONJ J,------------- _ 2 mark _ _
2 by být AUX Vc------------- Mood=Cnd|Person=3|VerbForm=Fin 0 root _ _
Fixed in #17
It looks like fix is not working for this "aby by" issue. I'm using latest version and I'm struggling with this issue still @asajatovic could you please take a look on it?
import spacy_udpipe
nlp = spacy_udpipe.load('cs')
print([t for t in nlp('aby')])
[aby, by]
I'm used spacy-udpipe to lemmatize all words from a word2vec model, but it crashed with some one-word texts - Interestingly, that problematic were only 3 words out of 2mil: ['aby', 'Aby', 'ABY'] ...See the code and error:
Error:
I'm using Python 3.6.9, spacy==2.2.4, spacy-udpipe==0.2.0
It worked OK with czech-pdt-ud-2.4-190531.udpipe, spacy==2.2.3, spacy-udpipe==0.1.0 and it works without error with the newer version of spacy and spacy-udpipe when I manually loaded the model as
nlp = spacy_udpipe.load_from_path(lang="cs", path="<path_to>/czech-pdt-ud-2.4-190531.udpipe", meta={"description": "CS model"})
- so this strange issue seems to be in the modelczech-pdt-ud-2.5-191206.udpipe