Closes #184 | Implement dataloader for CVSS

jensan-1 commented 2 years ago

Closes #184 Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script nusantara/nusa_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _NUSANTARA_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one NusantaraConfig for the source schema and one for a nusantara schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_nusantara --path=nusantara/nusa_datasets/my_dataset/my_dataset.py.
[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

holylovenia commented 2 years ago

/test dataset=cvss

github-actions[bot] commented 2 years ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3046209125

holylovenia commented 2 years ago

Hi @jen-santoso, thank you for the dataloader. Actually, I'm a bit confused about how cvss_c and cvss_t differ, even more after I ran diff between the original data's tsv files and found nothing different between them (which in turn also caused that the data loaded according to cvss_c is exactly the same as cvss_t. Could you please help me understand?

jensan-1 commented 2 years ago

Hi @jen-santoso, thank you for the dataloader. Actually, I'm a bit confused about how cvss_c and cvss_t differ, even more after I ran diff between the original data's tsv files and found nothing different between them (which in turn also caused that the data loaded according to cvss_c is exactly the same as cvss_t. Could you please help me understand?

Hi @holylovenia, thank you for the question. As you mentioned, cvss_c and cvss_t has the exact .tsv files and audio filenames. The difference is the speaker on audio files, where cvss_c is synthesized (all speeches are in a single speaker's voice), while cvss_t has the voice transferred from the corresponding source speeches (different speakers). In short, the audio files in cvss_c and cvss_t have the same name and textual content, but different speakers.

Also, regarding the BuilderConfig not found in the test run, should I add a default cvss one instead of cvss_c and cvss_t?

jensan-1 commented 2 years ago

@holylovenia I just checked that apparently, the data provided in the CVSS would require original audio from Common Voice and original text translation in CoVoST2. Ref: https://github.com/google-research-datasets/cvss Maybe need to implement speech-to-speech translation schema?

holylovenia commented 2 years ago

Hi @jen-santoso. Yeah, apparently a new schema and a new task are needed. I've made a PR for that at #262.

jensan-1 commented 2 years ago

test unit python -m tests.test_nusantara nusantara/nusa_datasets/cvss/cvss.py --subset_id=cvss_c

jensan-1 commented 2 years ago

/test dataset=cvss subset_id=cvss_c

github-actions[bot] commented 2 years ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3068096680

jensan-1 commented 2 years ago

/test dataset=cvss subset_id=cvss_c

github-actions[bot] commented 2 years ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3068300392

github-actions[bot] commented 2 years ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3068305806

jensan-1 commented 2 years ago

Hi @holylovenia, please check the commit. I have removed the other modifications. Also, I filled the name with original and translation, though for the original I note it as original and client_id provided. Please check

SamuelCahyawijaya commented 2 years ago

@jen-santoso : Thanks for contributing, the PR looks great now! Approving this PR!

IndoNLP / nusa-crowd