The Bahasa Rojak Crawled Corpus (BRCC) is a code-mixed dataset for the Bahasa Rojak dialect in Malaysia. Passages are generated through data augmentation from English and Malay Wikipedia pages using a modified CoSDA-ML method. The quality of generated passages is evaluated by two native Malay speakers.
Dataloader name:
brcc/brcc.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?brcc