SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
57 stars 54 forks source link

Closes #537 | Add Dataloader Bud500 #565

Closed akhdanfadh closed 1 month ago

akhdanfadh commented 3 months ago

Closes #537

This dataset is MASSIVE, and it seems the seacrowd test must download ALL data for it to be OK. I downloaded and loaded all the files (~100gb) and successfully tested it. I'm putting the result here: bud500.txt

For those with limited internet quota, I'm suggesting to load the data in Python's REPL, passing streaming=True as follows:

>>> from datasets import load_dataset
>>> 
>>> data = load_dataset("seacrowd/sea_datasets/bud500", name="bud500_source", split="train", streaming=True)
>>> list(data.take(3))
[{'audio': {'path': None, 'array': array([ 0.04827881, ..., -0.08636475]), 'sampling_rate': 16000}, 'transcription': 'thế mà hôm nay lại nghe em gái nhắc đến'},
 {'audio': {'path': None, 'array': array([-0.01934814, ...,  0.38327026]), 'sampling_rate': 16000}, 'transcription': 'các vấn đề y học chuyên khoa hoặc ứng'},
 {'audio': {'path': None, 'array': array([-0.23605347, ...,  0.0322876 ]), 'sampling_rate': 16000}, 'transcription': 'không được về nhà ăn tết thì là năm nay'}]
>>> 
>>> data = load_dataset("seacrowd/sea_datasets/bud500", name="bud500_seacrowd_sptext", split="test", streaming=True)
>>> list(data.take(3))
[{'id': '0', 'path': None, 'audio': {'path': None, 'array': array([0.02420044 , ..., 0.05227661 ]), 'sampling_rate': 16000}, 'text': 'tôi thì tôi nghĩ rằng là hầu hết tất cả', 'speaker_id': None, 'metadata': None},
 {'id': '1', 'path': None, 'audio': {'path': None, 'array': array([-0.00128174, ..., -0.14556885]), 'sampling_rate': 16000}, 'text': 'khách du lịch quốc tế và trong nước bốn', 'speaker_id': None, 'metadata': None},
 {'id': '2', 'path': None, 'audio': {'path': None, 'array': array([-0.00796509, ..., -0.06576538]), 'sampling_rate': 16000}, 'text': 'sơn đang làm ở việt nam', 'speaker_id': None, 'metadata': None}]

Checkbox

holylovenia commented 2 months ago

Replacing @danjohnvelasco with @yongzx due to inactivity.

akhdanfadh commented 2 months ago

Done addressing @raileymontalan reviews. Waiting for @yongzx's.

yongzx commented 2 months ago

Thanks @akhdanfadh. The unit tests successfully run on my end and the code LGTM!

holylovenia commented 2 months ago

Hi @raileymontalan, do you have anything else to suggest? If @akhdanfadh has addressed all your concerns, I'm inclined to merge this PR.