SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Closes #440 | Add Dataloader ASR-IBSC: AN IBAN SPEECH CORPORA #594

Closed akhdanfadh closed 1 month ago

akhdanfadh commented 3 months ago

Closes #440

Checkbox

akhdanfadh commented 2 months ago

this dataloader hasn't passed python -m tests.test_seacrowd seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py yet. Would you mind fixing it first until it can pass the unit test?

It is working in mine w/o any changes. Could you share your output?

Test result ``` INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py', schema=None, subset_id=None, data_dir=None, use_auth_token=None) INFO:__main__:self.PATH: seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py INFO:__main__:self.SUBSET_ID: asr_ibsc INFO:__main__:self.SCHEMA: None INFO:__main__:self.DATA_DIR: None INFO:__main__:Checking for _SUPPORTED_TASKS ... module seacrowd.sea_datasets.asr_ibsc.asr_ibsc INFO:__main__:Found _SUPPORTED_TASKS=[] INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SPTEXT'} INFO:__main__:schemas_to_check: {'SPTEXT'} INFO:__main__:Checking load_dataset with config name asr_ibsc_source /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:2508: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:923: FutureWarning: The repository for asr_ibsc contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( INFO:__main__:Checking load_dataset with config name asr_ibsc_seacrowd_sptext /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:2508: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:923: FutureWarning: The repository for asr_ibsc contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( INFO:__main__:Dataset sample [source] {'audio': {'path': 'ibf_001_001.wav', 'array': array([ 5.72814941e-01, 5.49011230e-01, -1.82495117e-02, ..., -2.31628418e-02, -1.26342773e-02, -3.05175781e-05]), 'sampling_rate': 16000}, 'transcription': 'pukul sepuluh malam'} /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:2508: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=' instead. warnings.warn( /home/akhdan/miniconda3/envs/seacrowd/lib/python3.10/site-packages/datasets/load.py:923: FutureWarning: The repository for asr_ibsc contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/asr_ibsc/asr_ibsc.py You can avoid this message in future by passing the argument `trust_remote_code=True`. Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`. warnings.warn( INFO:__main__:Dataset sample [seacrowd_sptext] {'id': '0', 'path': None, 'audio': {'path': 'ibf_001_001.wav', 'array': array([ 5.72814941e-01, 5.49011230e-01, -1.82495117e-02, ..., -2.31628418e-02, -1.26342773e-02, -3.05175781e-05]), 'sampling_rate': 16000}, 'text': 'pukul sepuluh malam', 'speaker_id': None, 'metadata': None} INFO:__main__:Checking global ID uniqueness INFO:__main__:Found 3132 unique IDs INFO:__main__:Gathering schema statistics INFO:__main__:Gathering schema statistics train ========== id: 3132 audio: 9396 text: 3132 . ---------------------------------------------------------------------- Ran 1 test in 6.545s OK ```
akhdanfadh commented 2 months ago

Well, I don't know why those comments result in an error on your end, but not on mine. I've uncommented the line there.

@holylovenia @sabilmakbar @faridlazuarda

holylovenia commented 2 months ago

A friendly reminder for @akhdanfadh to check on this PR. 👀

akhdanfadh commented 2 months ago

in case you missed it, it seems that this dataset has a test split aside from the train split. In case you didn't, is there a reason to exclude the test split?

@holylovenia there is only the train set in HF datacard and that's why. Since I think that the author in HF is not the original author, I will reimplement the dataloader with the github version. Let me work on this this weekend.

Probably better to remove the HF URL in our datasheet, no?

akhdanfadh commented 1 month ago

Done @holylovenia @faridlazuarda . Please re-review cause this one is a totally different implementation based on the GitHub data (instead HF).