Closes #537 | Add Dataloader Bud500

akhdanfadh commented 3 months ago

Closes #537

This dataset is MASSIVE, and it seems the seacrowd test must download ALL data for it to be OK. I downloaded and loaded all the files (~100gb) and successfully tested it. I'm putting the result here: bud500.txt

For those with limited internet quota, I'm suggesting to load the data in Python's REPL, passing streaming=True as follows:

>>> from datasets import load_dataset
>>> 
>>> data = load_dataset("seacrowd/sea_datasets/bud500", name="bud500_source", split="train", streaming=True)
>>> list(data.take(3))
[{'audio': {'path': None, 'array': array([ 0.04827881, ..., -0.08636475]), 'sampling_rate': 16000}, 'transcription': 'thế mà hôm nay lại nghe em gái nhắc đến'},
 {'audio': {'path': None, 'array': array([-0.01934814, ...,  0.38327026]), 'sampling_rate': 16000}, 'transcription': 'các vấn đề y học chuyên khoa hoặc ứng'},
 {'audio': {'path': None, 'array': array([-0.23605347, ...,  0.0322876 ]), 'sampling_rate': 16000}, 'transcription': 'không được về nhà ăn tết thì là năm nay'}]
>>> 
>>> data = load_dataset("seacrowd/sea_datasets/bud500", name="bud500_seacrowd_sptext", split="test", streaming=True)
>>> list(data.take(3))
[{'id': '0', 'path': None, 'audio': {'path': None, 'array': array([0.02420044 , ..., 0.05227661 ]), 'sampling_rate': 16000}, 'text': 'tôi thì tôi nghĩ rằng là hầu hết tất cả', 'speaker_id': None, 'metadata': None},
 {'id': '1', 'path': None, 'audio': {'path': None, 'array': array([-0.00128174, ..., -0.14556885]), 'sampling_rate': 16000}, 'text': 'khách du lịch quốc tế và trong nước bốn', 'speaker_id': None, 'metadata': None},
 {'id': '2', 'path': None, 'audio': {'path': None, 'array': array([-0.00796509, ..., -0.06576538]), 'sampling_rate': 16000}, 'text': 'sơn đang làm ở việt nam', 'speaker_id': None, 'metadata': None}]

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

holylovenia commented 2 months ago

Replacing @danjohnvelasco with @yongzx due to inactivity.

akhdanfadh commented 2 months ago

Done addressing @raileymontalan reviews. Waiting for @yongzx's.

yongzx commented 2 months ago

Thanks @akhdanfadh. The unit tests successfully run on my end and the code LGTM!

holylovenia commented 2 months ago

Hi @raileymontalan, do you have anything else to suggest? If @akhdanfadh has addressed all your concerns, I'm inclined to merge this PR.

SEACrowd / seacrowd-datahub

Closes #537 | Add Dataloader Bud500 #565

Checkbox