Create dataset loader for Glot500-c

Dataset	gloot500_c
Description	Glot500-c is a corpus of text including 511 languages, on which the Glot500-m LLM model was trained. This is a subset of Glot2000-c, based on a minimum number of sentences (30,000) exclusion criterion. This corpus is about 600 GB in size and contains about 1.5 billion sentences. The data is obtained in part by crawling data from websites and by compiling existing datasets. This means there may be overlap with other SEACrowd datasets. It also means that the licenses may be different for each of the underlying datasets, and some of the datasets will require specific registration and/or access requests. The work also contains several benchmarks and evaluation tasks on which Glot500-m is evaluated.
Subsets	-
Languages	bsb, iba, ind, eng, zsm, khm, lao, tha, tdt, por, ace, fil, vie, tih, mya, bcl, ceb, zlm, jav, sun, bjn, min, tgl, tam, hil, ilo, kac, war, ahk, dtp, ksw, lhu, pag, cmn, pam, bbc, ban, sxn, nia, btx, gor, mad, bts, mbb, prk, ibg, bhw, ifb, ifa, mrw
Tasks	Language Modeling, Named Entity Recognition, POS Tagging, Text Classification
License	Other (other)
Homepage	https://github.com/cisnlp/Glot500
HF URL	https://huggingface.co/datasets/cis-lmu/Glot500
Paper URL	https://aclanthology.org/2023.acl-long.61/

SEACrowd / seacrowd-datahub