Closed TomazErjavec closed 2 years ago
I gave the reasons for this bug on the ParlaMint side. If anyone thinks that something constructive could be done on the side of classla, please speak up. To me, this is a training data issue.
This is not an issue any more in classla 1.1.0.
As also described on ParlaMint issues most words that contain hyphens get wrongly lemmatised with - at least - the hyphen exchanged for some letter, and the rest of the word often mangled as well, e.g. 500-metrskimi /5t0emetrski, dnevno-varstveni /dnevnosvarstven, le-tega /lest, prostorsko-gradbeno /prostorskongradben, 27,5-odstotno /2vzfsodstoten, 80-odstotna /c0vodstoten.
It seems that this bug affects only Slovene, e.g. Croatian is ok. As far as I could tell, Bulgarian is also ok.