Helsinki-NLP / OPUS

The Open Parallel Corpus
54 stars 6 forks source link

WikiTitles en-ru is ru-en #12

Open eu9ene opened 5 months ago

eu9ene commented 5 months ago

I noticed weird scores while analyzing WikiTitles/v3 for en-ru language pair. It turned out that the direction of the downloaded dataset is the opposite of the language codes:

(base) admins-MBP:data epavlov$ head WikiTitles.en-ru.en 
Hijiri 
Литва 
Россия 
Слоновые 
Мамонты 
Красная книга 
Соционика 
Школа 
Лингвистика 
Социология 

(base) admins-MBP:data epavlov$ head WikiTitles.en-ru.ru 
Hijiri 
Lithuania 
Russia 
Elephantidae 
Mammoth 
IUCN Red List 
Socionics 
School 
Linguistics 
Sociology 

https://opus.nlpl.eu/WikiTitles/en&ru/v3/WikiTitles

Screenshot 2024-04-25 at 2 31 13 PM
jorgtied commented 5 months ago

Oh, that's bad. Do you know whether many other language pairs are affected in the same way? I need to look into this. Thanks for noting!

eu9ene commented 5 months ago

It doesn't show in the UI what other languages are supported, but English to Czech looks correct. I guess something got broken for this dataset:

Screenshot 2024-04-29 at 10 47 01 AM