Closes #528 | Add Dataloader national_speech_corpus_sg_imda

mrqorib commented 4 months ago

Closes #528

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/national_speech_corpus_sg_imda/national_speech_corpus_sg_imda.py', schema=None, subset_id=None, data_dir='/mnt/g/IMDA National Speech Corpus', use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/national_speech_corpus_sg_imda/national_speech_corpus_sg_imda.py
INFO:__main__:self.SUBSET_ID: national_speech_corpus_sg_imda
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: /mnt/g/IMDA National Speech Corpus
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.national_speech_corpus_sg_imda.national_speech_corpus_sg_imda
/mnt/d/1-primary/1-Research/seacrowd-datahub/env/lib/python3.10/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.SPEECH_RECOGNITION: 'ASR'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SPTEXT'}
INFO:__main__:schemas_to_check: {'SPTEXT'}
INFO:__main__:Checking load_dataset with config name national_speech_corpus_sg_imda_source
/mnt/d/1-primary/1-Research/seacrowd-datahub/env/lib/python3.10/site-packages/datasets/load.py:2516: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/mnt/d/1-primary/1-Research/seacrowd-datahub/env/lib/python3.10/site-packages/datasets/load.py:926: FutureWarning: The repository for national_speech_corpus_sg_imda contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/national_speech_corpus_sg_imda/national_speech_corpus_sg_imda.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Generating read_balanced_standing_mic split: 1049 examples [00:26, 39.13 examples/s]
Generating read_balanced_boundary_mic split: 784 examples [00:20, 38.87 examples/s]
Generating read_balanced_phone split: 651 examples [00:16, 38.53 examples/s]
Generating read_pertinent_standing_mic split: 815 examples [00:22, 36.41 examples/s]
Generating read_pertinent_boundary_mic split: 818 examples [00:21, 37.88 examples/s]
Generating read_pertinent_phone split: 860 examples [00:21, 40.84 examples/s]
Generating conversational_f2f_close_mic split: 8542 examples [01:52, 75.67 examples/s]
Generating conversational_f2f_boundary_mic split: 8542 examples [01:56, 73.45 examples/s]
Generating conversational_telephone_ivr split: 10567 examples [02:25, 72.86 examples/s]
Generating conversational_telephone_standing_mic split: 10567 examples [02:27, 71.48 examples/s]
INFO:__main__:Checking load_dataset with config name national_speech_corpus_sg_imda_seacrowd_sptext
Generating read_balanced_standing_mic split: 1049 examples [00:25, 40.64 examples/s]
Generating read_balanced_boundary_mic split: 784 examples [00:21, 37.16 examples/s]
Generating read_balanced_phone split: 651 examples [00:17, 37.89 examples/s]
Generating read_pertinent_standing_mic split: 815 examples [00:22, 36.84 examples/s]
Generating read_pertinent_boundary_mic split: 818 examples [00:20, 39.76 examples/s]
Generating read_pertinent_phone split: 860 examples [00:20, 41.21 examples/s]
Generating conversational_f2f_close_mic split: 8542 examples [02:02, 69.53 examples/s]
Generating conversational_f2f_boundary_mic split: 8542 examples [02:03, 68.94 examples/s]
Generating conversational_telephone_ivr split: 10567 examples [02:24, 73.00 examples/s]
Generating conversational_telephone_standing_mic split: 10567 examples [02:27, 71.49 examples/s]
INFO:__main__:Dataset sample [source]
{'id': '000020001', 'speaker_id': '0002', 'path': '~/.cache/nsc/PART1/DATA/CHANNEL0/WAVE/SPEAKER0002/SESSION0/000020001.WAV', 'audio': {'path': '~/.cache/nsc/PART1/DATA/CHANNEL0/WAVE/SPEAKER0002/SESSION0/000020001.WAV', 'array': array([3.96728516e-04, 6.10351562e-04, 5.18798828e-04, ...,
       2.13623047e-04, 9.15527344e-05, 1.83105469e-04]), 'sampling_rate': 16000}, 'text': 'I was so tired from work, I could not even bother to brush my teeth.'}
/mnt/d/1-primary/1-Research/seacrowd-datahub/env/lib/python3.10/site-packages/datasets/load.py:2516: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/mnt/d/1-primary/1-Research/seacrowd-datahub/env/lib/python3.10/site-packages/datasets/load.py:926: FutureWarning: The repository for national_speech_corpus_sg_imda contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/national_speech_corpus_sg_imda/national_speech_corpus_sg_imda.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
INFO:__main__:Dataset sample [seacrowd_sptext]
{'id': '000020001', 'path': '~/.cache/nsc/PART1/DATA/CHANNEL0/WAVE/SPEAKER0002/SESSION0/000020001.WAV', 'audio': {'path': '~/.cache/nsc/PART1/DATA/CHANNEL0/WAVE/SPEAKER0002/SESSION0/000020001.WAV', 'array': array([3.96728516e-04, 6.10351562e-04, 5.18798828e-04, ...,
       2.13623047e-04, 9.15527344e-05, 1.83105469e-04]), 'sampling_rate': 16000}, 'text': 'I was so tired from work, I could not even bother to brush my teeth.', 'speaker_id': '0002', 'metadata': {'speaker_age': None, 'speaker_gender': 'F'}}
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 10567 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
read_balanced_standing_mic
==========
id: 1049
path: 1049
audio: 3147
text: 1049
speaker_id: 1049
metadata: 2098

read_balanced_boundary_mic
==========
id: 784
path: 784
audio: 2352
text: 784
speaker_id: 784
metadata: 1568

read_balanced_phone
==========
id: 651
path: 651
audio: 1953
text: 651
speaker_id: 651
metadata: 1302

read_pertinent_standing_mic
==========
id: 815
path: 815
audio: 2445
text: 815
speaker_id: 815
metadata: 1630

read_pertinent_boundary_mic
==========
id: 818
path: 818
audio: 2454
text: 818
speaker_id: 818
metadata: 1636

read_pertinent_phone
==========
id: 860
path: 860
audio: 2580
text: 860
speaker_id: 860
metadata: 1720

conversational_f2f_close_mic
==========
id: 8542
path: 8542
audio: 25626
text: 8542
speaker_id: 8542
metadata: 17084

conversational_f2f_boundary_mic
==========
id: 8542
path: 8542
audio: 25626
text: 8542
speaker_id: 8542
metadata: 17084

conversational_telephone_ivr
==========
id: 10567
path: 10567
audio: 31701
text: 10567
speaker_id: 10567
metadata: 21134

conversational_telephone_standing_mic
==========
id: 10567
path: 10567
audio: 31701
text: 10567
speaker_id: 10567
metadata: 21134

.
----------------------------------------------------------------------
Ran 1 test in 1879.511s

OK

mrqorib commented 4 months ago

I apologize that the commits of medisco are also dragged into this PR, I created the branch from medisco's branch. Please just check the last two commits.

I have tested the dataloader manually, but please note that I only tested on a subset of the data as I kept failing to download the whole 1.2TB data. The subset is representative of the whole directory structures of the data.

ljvmiranda921 commented 4 months ago

Hi @mrqorib ! Saw your comment regarding the medisco commits. Is it possible to separate them? It might require some surgery (the simplest approach would be to use git --rebase onto X Y) but at least it's easier in the long-term to track which PR does what! Thank you so much!

holylovenia commented 4 months ago

Hi @mrqorib, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

@ljvmiranda921 @muhammadravi251001, since @mrqorib provided the unit test result, could you please do the review based on the implementation code? 🤔 Also feel free to ask @mrqorib to generate any outputs that would help your review process for this local dataloader.

holylovenia commented 4 months ago

Hi @muhammadravi251001, could we wait until @mrqorib remove the changes done to the medisco dataloader?

muhammadravi251001 commented 4 months ago

Hi @muhammadravi251001, could we wait until @mrqorib remove the changes done to the medisco dataloader?

Sure, it's up to you kak actually, since <31 hours remaining. For now I can't remove medisco dataloader by myself

muhammadravi251001 commented 4 months ago

Hi @muhammadravi251001, could we wait until @mrqorib remove the changes done to the medisco dataloader?

Sure, it's up to you kak actually, since <31 hours remaining. For now I can't remove medisco dataloader by myself

BTW @holylovenia, can we just delete both of the medisco files inside of the Files Changed tab here? Is that alright?

mrqorib commented 4 months ago

Hi all, thanks for your help with reviewing the PR. Sorry for the slow response, I was a bit busy. I can try to tidy up the medisco mess tonight. @holylovenia Please let me know if just deleting the medisco files would be fine in case the rebase method suggested by @ljvmiranda921 failed

muhammadravi251001 commented 4 months ago

Alright, after deleting the medisco commits via File Changed tab, now it's cool. Now, we can squash and merge this PR. Thanks for the contribution @mrqorib!

muhammadravi251001 commented 4 months ago

Hi all, thanks for your help with reviewing the PR. Sorry for the slow response, I was a bit busy. I can try to tidy up the medisco mess tonight. @holylovenia Please let me know if just deleting the medisco files would be fine in case the rebase method suggested by @ljvmiranda921 failed

I think just deleting medisco files can do it for now, because this PR now doesn't have any other dataloader implementation. I guess it is the main point.

SEACrowd / seacrowd-datahub

Closes #528 | Add Dataloader national_speech_corpus_sg_imda #676

Checkbox