Helsinki-NLP / OPUS-ingest

4 stars 0 forks source link

Add Multilingual corpus of Caucasian languages #14

Open jorgtied opened 1 year ago

jorgtied commented 1 year ago

Add multilingual corpus available from https://github.com/danielinux7/Multilingual-Parallel-Corpus

dotsuzu commented 1 year ago

Would it be easier if the data were updated in Tatoeba?

jorgtied commented 1 year ago

I tried to import the data but I have some issues with the TSV files. https://github.com/danielinux7/Multilingual-Parallel-Corpus/blob/master/ab-en/libreoffice.tsv has English in the fist column and https://github.com/danielinux7/Multilingual-Parallel-Corpus/blob/master/ab-en/Ab-En-Syn.tsv in the second.

https://github.com/danielinux7/Multilingual-Parallel-Corpus/blob/master/ab-ru/100-text.tsv has only one language and for the rest of the ab-ru files I don't know which column is the Russian one and which one is the Abkhazian.

This makes an import quite difficult.