Closes #442 | Add dataloader for ASR-MALCSC

zwenyu commented 6 months ago

Closes #442

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

elyanah-aco commented 5 months ago

Hi @zwenyu, thanks for the dataloader submission! Dataloader works fine for me, just have some suggestions and comments

zwenyu commented 5 months ago

@elyanah-aco Thanks for the comments. I've pushed changes according to the suggestions. For the returned text, I've removed the timestamps.

ljvmiranda921 commented 5 months ago

Hi @zwenyu , the data loader works for me! So no issues with implementation! As for the timestamps, I think we should include it in the text. Without it, can we still map the audio to the transcription? If not then maybe the timestamps are relevant 🤔 . Thoughts? @elyanah-aco

elyanah-aco commented 5 months ago

@ljvmiranda921 Honestly either is okay for me - I listened to some of the audio files and there's just silence in periods without timestamps. Each utterance is separated by \n right now which I think is enough. But I defer to what option is most useful to researchers at the end of the day, haha

ljvmiranda921 commented 5 months ago

Each utterance is separated by \n right now which I think is enough

Ah I see if that's the case then I think it's all good. LGTM!

holylovenia commented 4 months ago

@zwenyu Could you please add timestamp under metadata in the seacrowd_sptext schema as well? It should be fine as long as the unit test doesn't raise an error.

Added.

Thanks for the addition, @zwenyu!

@elyanah-aco, may I know if this PR is ready for merge?

SEACrowd / seacrowd-datahub

Closes #442 | Add dataloader for ASR-MALCSC #494

Checkbox