SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for VoxLingua107 #446

Closed SamuelCahyawijaya closed 6 months ago

SamuelCahyawijaya commented 7 months ago

Dataloader name: voxlingua/voxlingua.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?voxlingua

Dataset voxlingua
Description VoxLingua107 is a comprehensive speech dataset designed for training spoken language identification models. It comprises short speech segments sourced from YouTube videos, labeled based on the language indicated in the video title and description. The dataset covers 107 languages and contains a total of 6628 hours of speech data, averaging 62 hours per language. However, the actual amount of data per language varies significantly. Additionally, there is a separate development set consisting of 1609 speech segments from 33 languages, validated by at least two volunteers to ensure the accuracy of language representation.
Subsets -
Languages ceb, ind, jav, zlm, mya, sun, tha, tgl, vie, war, khm, lao
Tasks Spoken Language Identification
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://bark.phon.ioc.ee/voxlingua107/
HF URL https://huggingface.co/TalTechNLP/voxlingua107-epaca-tdnn
Paper URL https://arxiv.org/abs/2011.12998
djanibekov commented 7 months ago

self-assign

sabilmakbar commented 7 months ago

self-assign

sabilmakbar commented 7 months ago

Hi @holylovenia, may update this info? HF URL: https://huggingface.co/datasets/TalTechNLP/VoxLingua107 (but this one is incomplete, either)

holylovenia commented 7 months ago

Hi @holylovenia, may update this info? HF URL: https://huggingface.co/datasets/TalTechNLP/VoxLingua107 (but this one is incomplete, either)

Done. 🙏

sabilmakbar commented 6 months ago

And mind add khmer (khm) and lao (lao) language in the language list in this dataloader, @holylovenia? I saw the info was incomplete

holylovenia commented 6 months ago

And mind add khmer (khm) and lao (lao) language in the language list in this dataloader, @holylovenia? I saw the info was incomplete

Done, @sabilmakbar. Thanks a lot for the suggestion!