SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for Glot500-c #682

Open SamuelCahyawijaya opened 1 month ago

SamuelCahyawijaya commented 1 month ago

Dataloader name: gloot500_c/gloot500_c.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?gloot500_c

Dataset gloot500_c
Description Glot500-c is a corpus of text including 511 languages, on which the Glot500-m LLM model was trained. This is a subset of Glot2000-c, based on a minimum number of sentences (30,000) exclusion criterion. This corpus is about 600 GB in size and contains about 1.5 billion sentences. The data is obtained in part by crawling data from websites and by compiling existing datasets. This means there may be overlap with other SEACrowd datasets. It also means that the licenses may be different for each of the underlying datasets, and some of the datasets will require specific registration and/or access requests. The work also contains several benchmarks and evaluation tasks on which Glot500-m is evaluated.
Subsets -
Languages bsb, iba, ind, eng, zsm, khm, lao, tha, tdt, por, ace, fil, vie, tih, mya, bcl, ceb, zlm, jav, sun, bjn, min, tgl, tam, hil, ilo, kac, war, ahk, dtp, ksw, lhu, pag, cmn, pam, bbc, ban, sxn, nia, btx, gor, mad, bts, mbb, prk, ibg, bhw, ifb, ifa, mrw
Tasks Language Modeling, Named Entity Recognition, POS Tagging, Text Classification
License Other (other)
Homepage https://github.com/cisnlp/Glot500
HF URL https://huggingface.co/datasets/cis-lmu/Glot500
Paper URL https://aclanthology.org/2023.acl-long.61/