SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for Bud500 #537

Closed SamuelCahyawijaya closed 1 month ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: bud500/bud500.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?bud500

Dataset bud500
Description Bud500 is a diverse Vietnamese speech corpus designed to support ASR research community. With aprroximately 500 hours of audio, it covers a broad spectrum of topics including podcast, travel, book, food, and so on, while spanning accents from Vietnam's North, South, and Central regions. Derived from free public audio resources, this publicly accessible dataset is designed to significantly enhance the work of developers and researchers in the field of speech recognition.
Subsets -
Languages vie
Tasks Automatic Speech Recognition
License Apache license 2.0 (apache-2.0)
Homepage https://huggingface.co/datasets/linhtran92/viet_bud500
HF URL https://huggingface.co/datasets/linhtran92/viet_bud500
Paper URL https://github.com/quocanh34/Bud500
bp-high commented 3 months ago

self-assign

akhdanfadh commented 3 months ago

self-assign