Closed albertvillanova closed 2 years ago
Oops, it seems that this data is already available at Huggingface datasets: https://huggingface.co/datasets/un_multi
No, that's another corpus. I will work on this now.
I have a question about this dataset. It comes with parallel texts aligned across multiple languages. So it might not be uploaded as a huge plan text file as in generation tasks, because the translation pairs will be lost. Is there any specific format I should use for this dataset? Thanks!
This dataset is now available at: https://huggingface.co/datasets/bigscience-catalogue-data/uncorpus
Hi @lingjzhu,
Thanks for adding this dataset.
I just saw that you tried to label this issue asking for help but it didn't work. Next time you need help, please remember that you have to make a comment containing ONLY the keyword #help
(nothing else), so that it is automatically labeled with "help wanted".
In relation with the data files, could you please give more datails about the files you uploaded? In the Download page of this dataset, there appear different data files:
Sorry! I have uploaded the XML files. Each language has a separate XML file containing only monolingual text. The alignment is stored separately as a links.zip file containing alignments between six languages in the XML format.
Please let me know if there is any preprocessing that I need to do. Thank you.
Hi @lingjzhu,
Yes, XML files will require further preprocessing. The final target is that the dataset should be loadable as:
from datasets import load_dataset
ds = load_dataset("bigscience-catalogue-data/uncorpus", split="train", streaming=True, use_auth_token=True)
item = next(iter(ds))
DONE:
Sample:
{
'text': "Journal/nof the United Nations/nUnited Nations Forum on Forests/nTenth session (8 - 19 April 2013)/nIstanbul, Turkey/nProgramme of meetings/nFriday, 19 April 2013/nEconomic and Social Council/nTenth session/n09:00 to 13:00 14th meeting [webcast] Lütfi Kirdar Convention and Exhibition Centre/nAnadolu Hall/n1. Assessment of progress made on the implementation of the non-legally binding instrument on all types of forests and towards the achievement of the four global objectives on forests [item 3]/n2. Regional and subregional inputs [item 4]/n3. Forests and economic development [item 5]/n(a) Forest products and services...",
'meta': "{'file': 'en/2013/istanbul_journal_no__10___unff10.xml'}"
}
Source: Masader Project