bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Create dataset KALIMAT #293

Closed albertvillanova closed 2 years ago

albertvillanova commented 2 years ago

Source: Masader Project

albertvillanova commented 2 years ago

DONE: https://huggingface.co/datasets/bigscience-catalogue-data/kalimat

ds = load_dataset("bigscience-catalogue-data/kalimat", split="train", streaming=True, use_auth_token=True)
item = next(iter(ds))
item

output:

{'text': 'كتب'}
mariosasko commented 2 years ago

self-assign

mariosasko commented 2 years ago

Done! LM repo: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ar_kalimat

albertvillanova commented 2 years ago

Thanks @mariosasko