adobe / NLP-Cube

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
http://opensource.adobe.com/NLP-Cube/index.html
Apache License 2.0
552 stars 93 forks source link

Greek model of nlpcube 0.3.1.0 separates punctuation into other words #130

Closed KoichiYasuoka closed 3 years ago

KoichiYasuoka commented 3 years ago

Describe the bug Multiword segmentation of Greek model sometimes separates punctuation

To Reproduce

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("el")
>>> doc=nlp("Δεν υπάρχει βασιλικός δρόμος.")
>>> print(doc)
1-2 Δεν _   _   _   _   _   _   _   _
1   σ   σε  ADP AsPpSp  _   2   case    _   _
2   το  ο   DET AtDf    Case=Acc|Gender=Neut|Number=Sing    3   obl _   _
3   υπάρχει υπάρχω  VERB    VERB    Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act  0   root    _   _
4   βασιλικός   βασιλικός   ADJ ADJ Case=Nom|Gender=Masc|Number=Sing    5   amod    _   _
5   δρόμος  δρόμος  NOUN    NOUN    Case=Nom|Gender=Masc|Number=Sing    3   nsubj   _   _
6-7 .   _   _   _   _   _   _   _   _
6   σ   σε  ADP AsPpSp  _   2   case    _   _
7   τον ο   DET AtDf    Case=Acc|Gender=Masc|Number=Sing    3   det _   _

Expected behavior Neither "." nor "Δεν" is treated as multiword.

tiberiu44 commented 3 years ago

This should be an easy fix. I will update the models soon.

tiberiu44 commented 3 years ago

This (along with some other tokenization issues) is now fixed with the latest model:

In [4]: cube("Δεν υπάρχει βασιλικός δρόμος.")
Out[4]:
1   Δεν δεν PART    PART    _   2   advmod  _   _
2   υπάρχει υπάρχω  VERB    VERB    Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act  0   root    _   _
3   βασιλικός   βασιλικός   ADJ ADJ Case=Nom|Gender=Masc|Number=Sing    4   amod    _   _
4   δρόμος  δρόμος  NOUN    NOUN    Case=Nom|Gender=Masc|Number=Sing    2   nsubj   _   _
5   .   .   PUNCT   PUNCT   _   2   punct   _   _

The dataset is relatively small, so there will still be many cases where tokenization fails.

KoichiYasuoka commented 3 years ago

I've confirmed now Greek model works well. Thank you @tiberiu44 for quick response.