IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
260 stars 61 forks source link

Create dataset loader for VoxLingua107 #328

Open SamuelCahyawijaya opened 1 year ago

SamuelCahyawijaya commented 1 year ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?voxlingua

Dataset voxlingua
Description VoxLingua107 is a speech dataset for training spoken language identification models. The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives. VoxLingua107 contains data for 107 languages, including Indonesian, Javanese, and Sundanese.
License CC-BY 4.0
haryoa commented 1 year ago

self-assign