bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Create dataset unsupervised_cross_lingual_representation_learning_at_scale #253

Open albertvillanova opened 2 years ago

albertvillanova commented 2 years ago
albertvillanova commented 2 years ago

Already available: https://huggingface.co/datasets/cc100

Sample:


{'id': '0',
 'text': 'वैशाख २१ – आर्सनललाई हराउँदै एथ्लेटिको मड्रिड युरोपा लिगको फाइनलमा प्रवेश गरेको छ ।\n'}
mariosasko commented 2 years ago

self-assign

mariosasko commented 2 years ago

Done! LM repo: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_indic-ne_unsupervised_cross_lingual_representation_learning_at_scale

albertvillanova commented 2 years ago

Thanks @mariosasko

@yjernite is Nepali among the target languages? I can't find it in the list...