Closes #278 | Create dataset loader for INDspeech_DIGIT_CDSR

IvanHalimP commented 1 year ago

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset. Closes #278

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script nusantara/nusa_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _NUSANTARA_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one NusantaraConfig for the source schema and one for a nusantara schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_nusantara --path=nusantara/nusa_datasets/my_dataset/my_dataset.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

There are several things I found when working with this:

The test set is not unique; it has 23 duplicate entries from all 4 test sets.
The start '|S|' and end '|E|' token is not removed. Tell me if the removal is necessary.

That's all

IvanHalimP commented 1 year ago

/test dataset=indspeech_digit_cdsr

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3137018143

IvanHalimP commented 1 year ago

/test dataset=indspeech_digit_cdsr

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3163684591

IndoNLP / nusa-crowd