SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Closes #356 | Implement dataloader for CodeSwitch-Reddit #451

Closed elyanah-aco closed 7 months ago

elyanah-aco commented 7 months ago

Closes #356.

Notes:

Checkbox

sabilmakbar commented 7 months ago

Hi @elyanah-aco, Would you like to create another PR for the config addition for code-switching? Thanks!

sabilmakbar commented 7 months ago

for clarification: to test this dataloader into SEACrowd testcases, we have to use --schema args too aside from --subset_id. Tested using these:

python -m tests.test_seacrowd seacrowd/sea_datasets/codeswitch_reddit/codeswitch_reddit.py --subset_id codeswitch_reddit_cs --schema TEXT_MULTI
python -m tests.test_seacrowd seacrowd/sea_datasets/codeswitch_reddit/codeswitch_reddit.py --subset_id codeswitch_reddit_eng_monolingual --schema SSP