SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
63 stars 57 forks source link

Create dataset loader for mc4-indo #61

Closed SamuelCahyawijaya closed 10 months ago

SamuelCahyawijaya commented 10 months ago

Dataloader name: mc4_indo/mc4_indo.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?mc4_indo

Dataset mc4_indo
Description A thoroughly cleaned version of the Indonesia split of the multilingual colossal, cleaned version of Common Crawl's web crawl corpus (mC4). This portion represents the Indonesian language content that has been extracted and processed from the larger mC4 dataset. The extraction and cleaning process was conducted by AllenAI and resulted in a curated collection of Indonesian language data. For more information about the original mC4 dataset and its preparation, please refer to the source hosted at the address https://huggingface.co/datasets/allenai/c4.
Subsets -
Languages ind
Tasks Language Modeling
License Open Data Commons License Attribution family (odc-by)
Homepage https://huggingface.co/datasets/indonesian-nlp/mc4-id
HF URL https://huggingface.co/datasets/indonesian-nlp/mc4-id
Paper URL https://aclanthology.org/2021.naacl-main.41
williamnixon20 commented 10 months ago

self-assign