albertvillanova commented 2 years ago

uid: uncorpus
entry: https://arbml.github.io/masader/card.html?48
Link: https://conferences.unite.un.org/uncorpus
License : custom
Year: 2016
Language: multilingual
Dialect: ar-MSA: (Arabic (Modern Standard Arabic))
Domain: other
Form: text
Collection Style: human translation
Description: The parallel corpus presented consists of manually translated UN documents from the last 25 years
Volume: 540,152
Unit: documents
Ethical Risks: Low
Provider: United Nations
Derived From:
Paper Title: The United Nations Parallel Corpus v1.0
Paper Link: https://conferences.unite.un.org/UNCORPUS/Content/Doc/un.pdf
Script: Arab
Tokenized: No
Host: United Nations
Access: Free
Cost:
Test Split: Yes
Tasks: machine translation
Evaluation Set?:
Venue Title: LREC
Citations: 233
Venue Type: conference
Venue Name: International Conference on Language Resources and Evaluation
authors: Michal Ziemski,Marcin Junczys-Dowmunt,B. Pouliquen
affiliations: ,,
abstract: This paper describes the creation process and statistics of the official United Nations Parallel Corpus, the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal license. Apart from the pairwise aligned documents, a fully aligned subcorpus for the six official UN languages is distributed. We provide baseline BLEU scores of our Moses-based SMT systems trained with the full data of language pairs involving English and for all possible translation directions of the six-way subcorpus.
Added by :
Notes:

lingjzhu commented 2 years ago

self-assign

lingjzhu commented 2 years ago

~~Oops, it seems that this data is already available at Huggingface datasets: https://huggingface.co/datasets/un_multi~~

No, that's another corpus. I will work on this now.

lingjzhu commented 2 years ago

help

I have a question about this dataset. It comes with parallel texts aligned across multiple languages. So it might not be uploaded as a huge plan text file as in generation tasks, because the translation pairs will be lost. Is there any specific format I should use for this dataset? Thanks!

lingjzhu commented 2 years ago

This dataset is now available at: https://huggingface.co/datasets/bigscience-catalogue-data/uncorpus

albertvillanova commented 2 years ago

Hi @lingjzhu,

Thanks for adding this dataset.

I just saw that you tried to label this issue asking for help but it didn't work. Next time you need help, please remember that you have to make a comment containing ONLY the keyword #help (nothing else), so that it is automatically labeled with "help wanted".

In relation with the data files, could you please give more datails about the files you uploaded? In the Download page of this dataset, there appear different data files:

XML files
Plain-text bitexts
Fully aligned subcorpus
...

lingjzhu commented 2 years ago

Sorry! I have uploaded the XML files. Each language has a separate XML file containing only monolingual text. The alignment is stored separately as a links.zip file containing alignments between six languages in the XML format.

Please let me know if there is any preprocessing that I need to do. Thank you.

albertvillanova commented 2 years ago

Hi @lingjzhu,

Yes, XML files will require further preprocessing. The final target is that the dataset should be loadable as:

from datasets import load_dataset
ds = load_dataset("bigscience-catalogue-data/uncorpus", split="train", streaming=True, use_auth_token=True)
item = next(iter(ds))

albertvillanova commented 2 years ago

DONE:

Sample:


{
  'text': "Journal/nof the United Nations/nUnited Nations Forum on Forests/nTenth session (8 - 19 April 2013)/nIstanbul, Turkey/nProgramme of meetings/nFriday, 19 April 2013/nEconomic and Social Council/nTenth session/n09:00 to 13:00 14th meeting [webcast] Lütfi Kirdar Convention and Exhibition Centre/nAnadolu Hall/n1. Assessment of progress made on the implementation of the non-legally binding instrument on all types of forests and towards the achievement of the four global objectives on forests [item 3]/n2. Regional and subregional inputs [item 4]/n3. Forests and economic development [item 5]/n(a) Forest products and services...",
  'meta': "{'file': 'en/2013/istanbul_journal_no__10___unff10.xml'}"
}

bigscience-workshop / data_tooling

Create dataset United Nations Parallel Corpus #277

self-assign

help