Helsinki-NLP / OPUS

The Open Parallel Corpus
51 stars 6 forks source link

All sentences in `ELRC-{3056-,}wikipedia_health` zh-en end with spaces, possibly duplicates #4

Open jelmervdl opened 1 year ago

jelmervdl commented 1 year ago

I by chance noticed this, but all data formats for this particular dataset seem to end with spaces at the end of the lines. The original source files, from https://www.elrc-share.eu/repository/browse/covid-19-health-wikipedia-dataset-bilingual-en-zh/c6236d148de811ea913100155d026706c2a9a16f8fc74d0487006e8379d322a0/, don't seem to have this issue.

Also, these might be duplicates. The samples are different, but en-zh tmx is exactly the same except for the creation header:

I haven't checked all other ELRC imported datasets, but another en-zh didn't seem to have this issue.

jorgtied commented 1 year ago

About the duplicate entry: It's kind of intentional as https://opus.nlpl.eu/ELRC-wikipedia_health-v1.php combines all bitexts of COVID-19 related translations between English and other languages adding all language pairs pivoted by English. Maybe not the cleanest way to also include the English parts in this corpus again but, on the other hand, this nicely creates a multi-parallel corpus with all languages properly linked with each other. I am not sure whether I should change this or not.