clarinsi / classla

CLASSLA Fork of the Official Stanford NLP Python Library for Many Human Languages
https://www.clarin.si/info/k-centre/
Other
38 stars 19 forks source link

Croatian, Serbian and Bulgarian standard models mistag quotes #5

Closed nljubesi closed 1 year ago

nljubesi commented 4 years ago

The sentence

Tako i srpski „trpi“ „okupaciju“ od strane engleskog jezika kao lingua franca.

alarmed us that a good part of the models has not seen, neither in the training data, nor in the embeddings, alternative quotes. With the Serbian standard model the above quotes are tagged as nouns and adjectives. With the non-standard model the tags are correct. The Bulgarian model tags one of the quotes correctly, for the other it assumes to be a residual. The Croatian standard model correctly tags one of the quotes, while the other is tagged as an adjective or a noun. The non-standard model, again, preforms well. In Slovene the tags seem ok.

One potential approach might be to extend the tagger with a list of punctuations and enforce those. TBD.

vukbatanovic commented 4 years ago

I'm not sure the Slovene model is entirely free of errors of this sort, for instance in the following example the first quote is tagged as Xf by the non-standard model: Ja kak pa si naj jas zaj tou tolmačin? Je touti vaš „conditio sine qua non“ en nouvi müster za “funkcioniranja” v najbuj modično pokakani Prleččini al kaj?

lkrsnik commented 1 year ago

Such strings now get tagged by tokenizers (reldi and obeliks).