SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
68 stars 57 forks source link

Create dataset loader for Bloom-speech #163

Closed SamuelCahyawijaya closed 10 months ago

SamuelCahyawijaya commented 11 months ago

Dataloader name: bloom_speech/bloom_speech.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?bloom_speech

Dataset bloom_speech
Description This version of the Bloom Library data is developed specifically for the automatic speech recognition and speech-to-text tasks. It includes data from 56 languages across 18 language families. 8 languages are spoken in Southeast Asia
Subsets bjn, bzi, ceb, ind, jra, kqr, mya, tgl
Languages bjn, bzi, ceb, ind, jra, kqr, mya, tgl
Tasks Speech-to-Text Translation, Text-To-Speech Synthesis
License Other (other)
Homepage https://huggingface.co/datasets/sil-ai/bloom-speech
HF URL https://huggingface.co/datasets/sil-ai/bloom-speech
Paper URL https://aclanthology.org/2022.emnlp-main.590
sabilmakbar commented 11 months ago

self-assign

github-actions[bot] commented 11 months ago

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sabilmakbar commented 10 months ago

Will draft a PR later today.