IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
260 stars 62 forks source link

Closes #184 | Implement dataloader for CVSS #250

Closed jensan-1 closed 2 years ago

jensan-1 commented 2 years ago

Closes #184 Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Checkbox

holylovenia commented 2 years ago

/test dataset=cvss

github-actions[bot] commented 2 years ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3046209125

holylovenia commented 2 years ago

Hi @jen-santoso, thank you for the dataloader. Actually, I'm a bit confused about how cvss_c and cvss_t differ, even more after I ran diff between the original data's tsv files and found nothing different between them (which in turn also caused that the data loaded according to cvss_c is exactly the same as cvss_t. Could you please help me understand?

jensan-1 commented 2 years ago

Hi @jen-santoso, thank you for the dataloader. Actually, I'm a bit confused about how cvss_c and cvss_t differ, even more after I ran diff between the original data's tsv files and found nothing different between them (which in turn also caused that the data loaded according to cvss_c is exactly the same as cvss_t. Could you please help me understand?

Hi @holylovenia, thank you for the question. As you mentioned, cvss_c and cvss_t has the exact .tsv files and audio filenames. The difference is the speaker on audio files, where cvss_c is synthesized (all speeches are in a single speaker's voice), while cvss_t has the voice transferred from the corresponding source speeches (different speakers). In short, the audio files in cvss_c and cvss_t have the same name and textual content, but different speakers.

Also, regarding the BuilderConfig not found in the test run, should I add a default cvss one instead of cvss_c and cvss_t?

jensan-1 commented 2 years ago

@holylovenia I just checked that apparently, the data provided in the CVSS would require original audio from Common Voice and original text translation in CoVoST2. Ref: https://github.com/google-research-datasets/cvss Maybe need to implement speech-to-speech translation schema?

holylovenia commented 2 years ago

Hi @jen-santoso. Yeah, apparently a new schema and a new task are needed. I've made a PR for that at #262.

jensan-1 commented 2 years ago

test unit python -m tests.test_nusantara nusantara/nusa_datasets/cvss/cvss.py --subset_id=cvss_c

jensan-1 commented 2 years ago

/test dataset=cvss subset_id=cvss_c

github-actions[bot] commented 2 years ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3068096680

jensan-1 commented 2 years ago

/test dataset=cvss subset_id=cvss_c

github-actions[bot] commented 2 years ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3068300392

github-actions[bot] commented 2 years ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3068305806

jensan-1 commented 2 years ago

Hi @holylovenia, please check the commit. I have removed the other modifications. Also, I filled the name with original and translation, though for the original I note it as original and client_id provided. Please check

SamuelCahyawijaya commented 2 years ago

@jen-santoso : Thanks for contributing, the PR looks great now! Approving this PR!