clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

FI: significant Swedish text not marked as such #795

Open TomazErjavec opened 1 year ago

TomazErjavec commented 1 year ago

While doing MT Taja noticed that quite a lot of the text in the FI transcriptions is in fact in Swedish but is not marked as such. This is esp. bad for MT, as it is applying the Finnish model to the text marked as Finnish, which here includes the Swedish text. The result is that the Swedish text remains untranslated. A quick count (as the Swedish words in the MTed corpus are analysed as unknow PoS, i.e. 'X') shows that this affects 1,743,576 (7.6%) of the tokens. Obviously this can't be corrected for 4.0, so setting it to the Future milestone.