A thoroughly cleaned version of the Indonesia split of the multilingual colossal, cleaned version of Common Crawl's web crawl corpus (mC4). This portion represents the Indonesian language content that has been extracted and processed from the larger mC4 dataset. The extraction and cleaning process was conducted by AllenAI and resulted in a curated collection of Indonesian language data. For more information about the original mC4 dataset and its preparation, please refer to the source hosted at the address https://huggingface.co/datasets/allenai/c4.
Subsets
-
Languages
ind
Tasks
Language Modeling
License
Open Data Commons License Attribution family (odc-by)
Dataloader name:
mc4_indo/mc4_indo.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?mc4_indo