Helsinki-NLP / OPUS

The Open Parallel Corpus
58 stars 7 forks source link

problem with Georgian data in OpenSubtitles #16

Open jorgtied opened 4 months ago

jorgtied commented 4 months ago

https://opus.nlpl.eu/OpenSubtitles/en&ka/v2018/OpenSubtitles

Almost every data point is damaged. Georgian part is nonsense. When I searched those data in OpenSubtitle site, I found out that those are just Russian characters mapped onto Georgian alphabet. Nowadays many multilingual model is poisoned because of that data. It would be great to investigate more into that topic.