Open albertvillanova opened 2 years ago
This dataset is already available here: https://huggingface.co/datasets/un_multi
Question for @mariosasko: are you still working on this?
Yes, I'll finish this one today.
Done! LM repos:
ar
: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ar_multi_un_2de
: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_de_multi_un_2en
: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_multi_un_2es
: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_es_multi_un_2fr
: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_fr_multi_un_2ru
: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ru_multi_un_2zh
: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh_multi_un_2Note: In the scripts, I pull the data directly from the single language archives from http://www.euromatrixplus.net/multi-un/ and not from the translation files as it's done in https://huggingface.co/datasets/un_multi.
Thanks @mariosasko, yes, well done! ;)
On the other hand, I think there are more languages than the target ones. Should we remove the extra ones? CC: @yjernite
Source: Masader Project