bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Create dataset MultiUN v2 #288

Open albertvillanova opened 2 years ago

albertvillanova commented 2 years ago

Source: Masader Project

lingjzhu commented 2 years ago

self-assign

lingjzhu commented 2 years ago

This dataset is already available here: https://huggingface.co/datasets/un_multi

mariosasko commented 2 years ago

self-assign

albertvillanova commented 2 years ago

Question for @mariosasko: are you still working on this?

mariosasko commented 2 years ago

Yes, I'll finish this one today.

mariosasko commented 2 years ago

Done! LM repos:

Note: In the scripts, I pull the data directly from the single language archives from http://www.euromatrixplus.net/multi-un/ and not from the translation files as it's done in https://huggingface.co/datasets/un_multi.

albertvillanova commented 2 years ago

Thanks @mariosasko, yes, well done! ;)

On the other hand, I think there are more languages than the target ones. Should we remove the extra ones? CC: @yjernite