Closes #440 | Add Dataloader ASR-IBSC: AN IBAN SPEECH CORPORA

akhdanfadh commented 3 months ago

Closes #440

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

akhdanfadh commented 2 months ago

this dataloader hasn't passed python -m tests.test_seacrowd seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py yet. Would you mind fixing it first until it can pass the unit test?

It is working in mine w/o any changes. Could you share your output?

Test result

``` INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py', schema=None, subset_id=None, data_dir=None, use_auth_token=None) INFO:__main__:self.PATH: seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py INFO:__main__:self.SUBSET_ID: asr_ibsc INFO:__main__:self.SCHEMA: None INFO:__main__:self.DATA_DIR: None INFO:__main__:Checking for _SUPPORTED_TASKS ... module seacrowd.sea_datasets.asr_ibsc.asr_ibsc INFO:__main__:Found _SUPPORTED_TASKS=[] INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SPTEXT'} INFO:__main__:schemas_to_check: {'SPTEXT'} INFO:__main__:Checking load_dataset with config name asr_ibsc_source /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:2508: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:923: FutureWarning: The repository for asr_ibsc contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( INFO:__main__:Checking load_dataset with config name asr_ibsc_seacrowd_sptext /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:2508: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:923: FutureWarning: The repository for asr_ibsc contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( INFO:__main__:Dataset sample [source] {'audio': {'path': 'ibf_001_001.wav', 'array': array([ 5.72814941e-01, 5.49011230e-01, -1.82495117e-02, ..., -2.31628418e-02, -1.26342773e-02, -3.05175781e-05]), 'sampling_rate': 16000}, 'transcription': 'pukul sepuluh malam'} /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:2508: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:923: FutureWarning: The repository for asr_ibsc contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( INFO:__main__:Dataset sample [seacrowd_sptext] {'id': '0', 'path': None, 'audio': {'path': 'ibf_001_001.wav', 'array': array([ 5.72814941e-01, 5.49011230e-01, -1.82495117e-02, ..., -2.31628418e-02, -1.26342773e-02, -3.05175781e-05]), 'sampling_rate': 16000}, 'text': 'pukul sepuluh malam', 'speaker_id': None, 'metadata': None} INFO:__main__:Checking global ID uniqueness INFO:__main__:Found 3132 unique IDs INFO:__main__:Gathering schema statistics INFO:__main__:Gathering schema statistics train ========== id: 3132 audio: 9396 text: 3132 . ---------------------------------------------------------------------- Ran 1 test in 6.545s OK ```

akhdanfadh commented 2 months ago

Well, I don't know why those comments result in an error on your end, but not on mine. I've uncommented the line there.

@holylovenia @sabilmakbar @faridlazuarda

holylovenia commented 2 months ago

A friendly reminder for @akhdanfadh to check on this PR. 👀

akhdanfadh commented 2 months ago

in case you missed it, it seems that this dataset has a test split aside from the train split. In case you didn't, is there a reason to exclude the test split?

@holylovenia there is only the train set in HF datacard and that's why. Since I think that the author in HF is not the original author, I will reimplement the dataloader with the github version. Let me work on this this weekend.

Probably better to remove the HF URL in our datasheet, no?

akhdanfadh commented 1 month ago

Done @holylovenia @faridlazuarda . Please re-review cause this one is a totally different implementation based on the GitHub data (instead HF).

SEACrowd / seacrowd-datahub

Closes #440 | Add Dataloader ASR-IBSC: AN IBAN SPEECH CORPORA #594

Checkbox