Closes #356 | Implement dataloader for CodeSwitch-Reddit

elyanah-aco commented 7 months ago

Closes #356.

Notes:

I added a new task CODE_SWITCHING_IDENTIFICATION that uses the seacrowd_text_multi schema and takes on language codes as labels.
Dataset has two exclusive subsets cs and eng_monolingual. The cs subset uses the seacrowd_text_multi schema and eng_monolingual uses seacrowd_ssp.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

sabilmakbar commented 7 months ago

Hi @elyanah-aco, Would you like to create another PR for the config addition for code-switching? Thanks!

sabilmakbar commented 7 months ago

for clarification: to test this dataloader into SEACrowd testcases, we have to use --schema args too aside from --subset_id. Tested using these:

python -m tests.test_seacrowd seacrowd/sea_datasets/codeswitch_reddit/codeswitch_reddit.py --subset_id codeswitch_reddit_cs --schema TEXT_MULTI
python -m tests.test_seacrowd seacrowd/sea_datasets/codeswitch_reddit/codeswitch_reddit.py --subset_id codeswitch_reddit_eng_monolingual --schema SSP

SEACrowd / seacrowd-datahub

Closes #356 | Implement dataloader for CodeSwitch-Reddit #451

Checkbox