SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
60 stars 56 forks source link

Closes #267 | Add dataloader for struct_amb_ind #506

Closed jensan-1 closed 5 months ago

jensan-1 commented 5 months ago

Closes #267

Checkbox

jensan-1 commented 5 months ago

As per discussion in the issue, we will treat this dataset as _LOCAL=TRUE.

Here is the output of the unit test

(env-seacrowd) jen-santoso@DESKTOP-7FKFB32:/mnt/c/Users/JenniferSantoso/seacrowd-datahub$ python -m tests.test_seacrowd seacrowd/sea_datasets/struct_amb_ind/struct_amb_ind.py --data_dir seacrowd/sea_datasets/struct_amb_ind/00_dataset/ind_speech/
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/struct_amb_ind/struct_amb_ind.py', schema=None, subset_id=None, data_dir='seacrowd/sea_datasets/struct_amb_ind/00_dataset/ind_speech/', use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/struct_amb_ind/struct_amb_ind.py
INFO:__main__:self.SUBSET_ID: struct_amb_ind
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: seacrowd/sea_datasets/struct_amb_ind/00_dataset/ind_speech/
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.struct_amb_ind.struct_amb_ind
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.SPEECH_RECOGNITION: 'ASR'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SPTEXT'}
INFO:__main__:schemas_to_check: {'SPTEXT'}
INFO:__main__:Checking load_dataset with config name struct_amb_ind_source
/home/jen-santoso/miniconda3/envs/env-seacrowd/lib/python3.9/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
Generating train split: 4160 examples [00:00, 31294.04 examples/s]
Generating validation split: 320 examples [00:00, 20693.13 examples/s]
Generating test split: 320 examples [00:00, 21274.35 examples/s]
INFO:__main__:Checking load_dataset with config name struct_amb_ind_seacrowd_sptext
/home/jen-santoso/miniconda3/envs/env-seacrowd/lib/python3.9/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
Generating train split: 4160 examples [00:00, 23054.43 examples/s]
Generating validation split: 320 examples [00:00, 18871.74 examples/s]
Generating test split: 320 examples [00:00, 18878.91 examples/s]
INFO:__main__:Dataset sample [source]
{'id': 'ID_F01_Type06_00311', 'speaker_id': 'F01', 'path': '/home/jen-santoso/.cache/huggingface/datasets/downloads/extracted/2ca495465c805376d2d9f44a3070e60f821e82adaaae3afaf8854773f99801bc/F01/ID_F01_Type06_00311.wav', 'audio': {'path': '/home/jen-santoso/.cache/huggingface/datasets/downloads/extracted/2ca495465c805376d2d9f44a3070e60f821e82adaaae3afaf8854773f99801bc/F01/ID_F01_Type06_00311.wav', 'array': array([ 1.57810496e-06,  5.13599844e-07,  6.57081330e-07, ...,
       -1.85639988e-06, -3.42808704e-07,  9.34161903e-07]), 'sampling_rate': 16000}, 'amb_transcript': 'pedagang itu merebus ayam kemarin sore', 'disam_text': 'pedagang itu merebus ayam. pedagang itu merebusnya kemarin sore '}
/home/jen-santoso/miniconda3/envs/env-seacrowd/lib/python3.9/site-packages/datasets/load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
INFO:__main__:Dataset sample [seacrowd_sptext]
{'id': 'ID_F01_Type06_00311', 'path': '/home/jen-santoso/.cache/huggingface/datasets/downloads/extracted/2ca495465c805376d2d9f44a3070e60f821e82adaaae3afaf8854773f99801bc/F01/ID_F01_Type06_00311.wav', 'audio': {'path': '/home/jen-santoso/.cache/huggingface/datasets/downloads/extracted/2ca495465c805376d2d9f44a3070e60f821e82adaaae3afaf8854773f99801bc/F01/ID_F01_Type06_00311.wav', 'array': array([ 1.57810496e-06,  5.13599844e-07,  6.57081330e-07, ...,
       -1.85639988e-06, -3.42808704e-07,  9.34161903e-07]), 'sampling_rate': 16000}, 'text': 'pedagang itu merebus ayam kemarin sore', 'speaker_id': 'F01', 'metadata': {'speaker_age': None, 'speaker_gender': 'F'}}
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 320 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 4160
path: 4160
audio: 12480
text: 4160
speaker_id: 4160
metadata: 8320

validation
==========
id: 320
path: 320
audio: 960
text: 320
speaker_id: 320
metadata: 640

test
==========
id: 320
path: 320
audio: 960
text: 320
speaker_id: 320
metadata: 640

.
----------------------------------------------------------------------
Ran 1 test in 51.810s

OK