SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Create dataset loader for CCMatrix #514

Closed SamuelCahyawijaya closed 3 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: ccmatrix/ccmatrix.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?ccmatrix

Dataset ccmatrix
Description The CCMatrix dataset was collected from web crawls and released by Meta. The dataset is constructed based on the margin-based bitext mining which can be applied to monolingual corpora of billions of sentences to produce high quality aligned translation data.
Subsets -
Languages jav, eng, vie, ind, tgl, mya, zlm
Tasks Language Modeling, Machine Translation
License BSD license family (bsd)
Homepage https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix
HF URL https://huggingface.co/datasets/yhavinga/ccmatrix
Paper URL https://aclanthology.org/2021.acl-long.507/
patrickamadeus commented 5 months ago

self-assign