bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 49 forks source link

Create dataset United Nations Parallel Corpus #277

Closed albertvillanova closed 2 years ago

albertvillanova commented 2 years ago

Source: Masader Project

lingjzhu commented 2 years ago

self-assign

lingjzhu commented 2 years ago

Oops, it seems that this data is already available at Huggingface datasets: https://huggingface.co/datasets/un_multi

No, that's another corpus. I will work on this now.

lingjzhu commented 2 years ago

help

I have a question about this dataset. It comes with parallel texts aligned across multiple languages. So it might not be uploaded as a huge plan text file as in generation tasks, because the translation pairs will be lost. Is there any specific format I should use for this dataset? Thanks!

lingjzhu commented 2 years ago

This dataset is now available at: https://huggingface.co/datasets/bigscience-catalogue-data/uncorpus

albertvillanova commented 2 years ago

Hi @lingjzhu,

Thanks for adding this dataset.

I just saw that you tried to label this issue asking for help but it didn't work. Next time you need help, please remember that you have to make a comment containing ONLY the keyword #help (nothing else), so that it is automatically labeled with "help wanted".

In relation with the data files, could you please give more datails about the files you uploaded? In the Download page of this dataset, there appear different data files:

lingjzhu commented 2 years ago

Sorry! I have uploaded the XML files. Each language has a separate XML file containing only monolingual text. The alignment is stored separately as a links.zip file containing alignments between six languages in the XML format.

Please let me know if there is any preprocessing that I need to do. Thank you.

albertvillanova commented 2 years ago

Hi @lingjzhu,

Yes, XML files will require further preprocessing. The final target is that the dataset should be loadable as:

from datasets import load_dataset
ds = load_dataset("bigscience-catalogue-data/uncorpus", split="train", streaming=True, use_auth_token=True)
item = next(iter(ds))
albertvillanova commented 2 years ago

DONE:

Sample:


{
  'text': "Journal/nof the United Nations/nUnited Nations Forum on Forests/nTenth session (8 - 19 April 2013)/nIstanbul, Turkey/nProgramme of meetings/nFriday, 19 April 2013/nEconomic and Social Council/nTenth session/n09:00 to 13:00 14th meeting [webcast] Lütfi Kirdar Convention and Exhibition Centre/nAnadolu Hall/n1. Assessment of progress made on the implementation of the non-legally binding instrument on all types of forests and towards the achievement of the four global objectives on forests [item 3]/n2. Regional and subregional inputs [item 4]/n3. Forests and economic development [item 5]/n(a) Forest products and services...",
  'meta': "{'file': 'en/2013/istanbul_journal_no__10___unff10.xml'}"
}