bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Create dataset BRAD 2.0 #283

Closed albertvillanova closed 2 years ago

albertvillanova commented 2 years ago

Source: Masader Project

KhalidAlt commented 2 years ago

self-assign

albertvillanova commented 2 years ago

Thanks @KhalidAlt !

Link: https://huggingface.co/datasets/bigscience-catalogue-data/brad_2

Note that we are interested in the unbalanced subset, that contains the whole dataset.

albertvillanova commented 2 years ago

DONE: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ar_brad_2

Sample:


{
  'text': 'صراع الجذور والانتماء، عقلة ساق الخيزان توائم نفسها وتنمو ايا كانت التربة. فكك الكاتب المجتمع الفلبيني والكويتي،غاص عميقا عميقا في تعقيداتهما معا،، رواية ممتعة.',
  'meta': "{'id': '1682581870'}"
}