Create dataset loader for VoxLingua107

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?voxlingua

Dataset	voxlingua
Description	VoxLingua107 is a speech dataset for training spoken language identification models. The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives. VoxLingua107 contains data for 107 languages, including Indonesian, Javanese, and Sundanese.
License	CC-BY 4.0

Dataset

voxlingua

Description

VoxLingua107 is a speech dataset for training spoken language identification models. The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives. VoxLingua107 contains data for 107 languages, including Indonesian, Javanese, and Sundanese.

License

CC-BY 4.0

IndoNLP / nusa-crowd

Create dataset loader for VoxLingua107 #328

self-assign