SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.

Apache License 2.0

55 stars 54 forks source link

Create dataset loader for Bud500 #537

Closed SamuelCahyawijaya closed 1 month ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: bud500/bud500.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?bud500

Dataset	bud500
Description	Bud500 is a diverse Vietnamese speech corpus designed to support ASR research community. With aprroximately 500 hours of audio, it covers a broad spectrum of topics including podcast, travel, book, food, and so on, while spanning accents from Vietnam's North, South, and Central regions. Derived from free public audio resources, this publicly accessible dataset is designed to significantly enhance the work of developers and researchers in the field of speech recognition.
Subsets	-
Languages	vie
Tasks	Automatic Speech Recognition
License	Apache license 2.0 (apache-2.0)
Homepage	https://huggingface.co/datasets/linhtran92/viet_bud500
HF URL	https://huggingface.co/datasets/linhtran92/viet_bud500
Paper URL	https://github.com/quocanh34/Bud500

bp-high commented 3 months ago

self-assign

akhdanfadh commented 3 months ago

self-assign