albertvillanova commented 2 years ago

uid: multi_un_2
entry: https://arbml.github.io/masader/card.html?158
Link: http://www.euromatrixplus.net/multi-un/
License : unknown
Year: 2010
Language: multilingual
Dialect: ar-MSA: (Arabic (Modern Standard Arabic))
Domain: other
Form: text
Collection Style: human translation
Description: 6 official languages of the UN, consisting of around 300 million words per language
Volume: 65,156
Unit: documents
Ethical Risks: Low
Provider: DFKI
Derived From:
Paper Title: MultiUN: A Multilingual Corpus from United Nation Documents
Paper Link: https://www.dfki.de/fileadmin/user_upload/import/4790_686_Paper.pdf
Script: Arab
Tokenized: No
Host: other
Access: Free
Cost:
Test Split: Yes
Tasks: machine translation
Evaluation Set?:
Venue Title: LREC
Citations: 223
Venue Type: conference
Venue Name: International Conference on Language Resources and Evaluation
authors: A. Eisele
affiliations:
abstract: This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This corpus also includes a common test set for machine translation. We present the results of a French-Chinese machine translation experiment performed on this corpus.
Added by : Zaid
Notes:

lingjzhu commented 2 years ago

self-assign

lingjzhu commented 2 years ago

mariosasko commented 2 years ago

albertvillanova commented 2 years ago

Question for @mariosasko: are you still working on this?

mariosasko commented 2 years ago

Yes, I'll finish this one today.

mariosasko commented 2 years ago

Done! LM repos:

Note: In the scripts, I pull the data directly from the single language archives from http://www.euromatrixplus.net/multi-un/ and not from the translation files as it's done in https://huggingface.co/datasets/un_multi.

albertvillanova commented 2 years ago

Thanks @mariosasko, yes, well done! ;)

On the other hand, I think there are more languages than the target ones. Should we remove the extra ones? CC: @yjernite

bigscience-workshop / data_tooling