Glot500-c is a corpus of text including 511 languages, on which the Glot500-m LLM model was trained. This is a subset of Glot2000-c, based on a minimum number of sentences (30,000) exclusion criterion. This corpus is about 600 GB in size and contains about 1.5 billion sentences. The data is obtained in part by crawling data from websites and by compiling existing datasets. This means there may be overlap with other SEACrowd datasets. It also means that the licenses may be different for each of the underlying datasets, and some of the datasets will require specific registration and/or access requests. The work also contains several benchmarks and evaluation tasks on which Glot500-m is evaluated.
Dataloader name:
gloot500_c/gloot500_c.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?gloot500_c