Closed jensan-1 closed 2 years ago
/test dataset=cvss
Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3046209125
Hi @jen-santoso, thank you for the dataloader. Actually, I'm a bit confused about how cvss_c
and cvss_t
differ, even more after I ran diff
between the original data's tsv
files and found nothing different between them (which in turn also caused that the data loaded according to cvss_c
is exactly the same as cvss_t
. Could you please help me understand?
Hi @jen-santoso, thank you for the dataloader. Actually, I'm a bit confused about how
cvss_c
andcvss_t
differ, even more after I randiff
between the original data'stsv
files and found nothing different between them (which in turn also caused that the data loaded according tocvss_c
is exactly the same ascvss_t
. Could you please help me understand?
Hi @holylovenia, thank you for the question.
As you mentioned, cvss_c
and cvss_t
has the exact .tsv
files and audio filenames. The difference is the speaker on audio files, where cvss_c
is synthesized (all speeches are in a single speaker's voice), while cvss_t
has the voice transferred from the corresponding source speeches (different speakers). In short, the audio files in cvss_c
and cvss_t
have the same name and textual content, but different speakers.
Also, regarding the BuilderConfig not found
in the test run, should I add a default cvss
one instead of cvss_c
and cvss_t
?
@holylovenia I just checked that apparently, the data provided in the CVSS would require original audio from Common Voice and original text translation in CoVoST2. Ref: https://github.com/google-research-datasets/cvss
Maybe need to implement speech-to-speech translation
schema?
Hi @jen-santoso. Yeah, apparently a new schema and a new task are needed. I've made a PR for that at #262.
test unit
python -m tests.test_nusantara nusantara/nusa_datasets/cvss/cvss.py --subset_id=cvss_c
/test dataset=cvss subset_id=cvss_c
Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3068096680
/test dataset=cvss subset_id=cvss_c
Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3068300392
Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3068305806
Hi @holylovenia, please check the commit. I have removed the other modifications.
Also, I filled the name with original
and translation
, though for the original I note it as original
and client_id
provided. Please check
@jen-santoso : Thanks for contributing, the PR looks great now! Approving this PR!
Closes #184 Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
Checkbox
nusantara/nusa_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_NUSANTARA_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneNusantaraConfig
for the source schema and one for a nusantara schema.datasets.load_dataset
function.python -m tests.test_nusantara --path=nusantara/nusa_datasets/my_dataset/my_dataset.py
.