Bug with lemmatisation of Slovene words with hyphens

clarinsi / classla

CLASSLA Fork of the Official Stanford NLP Python Library for Many Human Languages

https://www.clarin.si/info/k-centre/

Other

38 stars 19 forks source link

Bug with lemmatisation of Slovene words with hyphens #21

Closed TomazErjavec closed 2 years ago

TomazErjavec commented 3 years ago

As also described on ParlaMint issues most words that contain hyphens get wrongly lemmatised with - at least - the hyphen exchanged for some letter, and the rest of the word often mangled as well, e.g. 500-metrskimi /5t0emetrski, dnevno-varstveni /dnevnosvarstven, le-tega /lest, prostorsko-gradbeno /prostorskongradben, 27,5-odstotno /2vzfsodstoten, 80-odstotna /c0vodstoten.

It seems that this bug affects only Slovene, e.g. Croatian is ok. As far as I could tell, Bulgarian is also ok.

nljubesi commented 3 years ago

I gave the reasons for this bug on the ParlaMint side. If anyone thinks that something constructive could be done on the side of classla, please speak up. To me, this is a training data issue.

nljubesi commented 2 years ago

This is not an issue any more in classla 1.1.0.