Create dataset loader for mc4-indo

Dataset	mc4_indo
Description	A thoroughly cleaned version of the Indonesia split of the multilingual colossal, cleaned version of Common Crawl's web crawl corpus (mC4). This portion represents the Indonesian language content that has been extracted and processed from the larger mC4 dataset. The extraction and cleaning process was conducted by AllenAI and resulted in a curated collection of Indonesian language data. For more information about the original mC4 dataset and its preparation, please refer to the source hosted at the address https://huggingface.co/datasets/allenai/c4.
Subsets	-
Languages	ind
Tasks	Language Modeling
License	Open Data Commons License Attribution family (odc-by)
Homepage	https://huggingface.co/datasets/indonesian-nlp/mc4-id
HF URL	https://huggingface.co/datasets/indonesian-nlp/mc4-id
Paper URL	https://aclanthology.org/2021.naacl-main.41

SEACrowd / seacrowd-datahub