SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for BRCC #519

Closed SamuelCahyawijaya closed 5 months ago

SamuelCahyawijaya commented 7 months ago

Dataloader name: brcc/brcc.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?brcc

Dataset brcc
Description The Bahasa Rojak Crawled Corpus (BRCC) is a code-mixed dataset for the Bahasa Rojak dialect in Malaysia. Passages are generated through data augmentation from English and Malay Wikipedia pages using a modified CoSDA-ML method. The quality of generated passages is evaluated by two native Malay speakers.
Subsets -
Languages zlm, eng, cmn
Tasks Language Modeling
License Unknown (unknown)
Homepage https://data.depositar.io/dataset/brcc_and_sentibahasarojak/resource/8a558f64-98ff-4922-a751-0ce2ce8447bd
HF URL -
Paper URL https://aclanthology.org/2022.coling-1.389.pdf
elyanah-aco commented 6 months ago

self-assign