IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 61 forks source link

Create dataset loader for LibriVox-Indonesia #253

Closed SamuelCahyawijaya closed 2 years ago

SamuelCahyawijaya commented 2 years ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?librivox

Dataset librivox
Description The LibriVox Indonesia dataset consists of MP3 audio and a corresponding text file we generated from the public domain audiobooks LibriVox. We collected only languages in Indonesia for this dataset. The original LibriVox audiobooks or sound files' duration varies from a few minutes to a few hours. Each audio file in the speech dataset now lasts from a few seconds to a maximum of 20 seconds.
License CC0
SamuelCahyawijaya commented 2 years ago

Duplicated with https://github.com/IndoNLP/nusa-crowd/issues/246