SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for National Speech Corpus #528

Closed SamuelCahyawijaya closed 4 months ago

SamuelCahyawijaya commented 7 months ago

Dataloader name: national_speech_corpus_sg_imda/national_speech_corpus_sg_imda.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?national_speech_corpus_sg_imda

Dataset national_speech_corpus_sg_imda
Description The National Speech Corpus (NSC) is the first large-scale Singapore English corpus spearheaded by the Info-communications and Media Development Authority (IMDA) of Singapore. It aims to become an important source of open speech data for automatic speech recognition (ASR) research and speech-related applications. The NSC improves speech engines’ accuracy of recognition and transcription for locally accented English. The NSC is also able to contribute to speech synthesis technology where an AI voice can be produced that is more familiar to Singaporeans, with local terms pronounced more accurately. Part 1 features about 1000 hours of prompted recordings of phonetically-balanced scripts from about 1000 local English speakers. Part 2 presents about 1000 hours of prompted recordings of sentences randomly generated from words based on people, food, location, brands, etc, from about 1000 local English speakers as well. Transcriptions of the recordings have been done orthographically and are available for download. Part 3 consists of about 1000 hours of conversational data recorded from about 1000 local English speakers, split into pairs. The data includes conversations covering daily life and of speakers playing games provided. Links to Part 4, 5, 6 are included in the DropBox. Please open the DropBox via the desktop application since the folder is very large.
Subsets read_balanced, read_pertinent, conversational_f2f, conversational_telephone
Languages eng
Tasks Automatic Speech Recognition, Text-To-Speech Synthesis
License Other (other)
Homepage https://www.imda.gov.sg/how-we-can-help/national-speech-corpus
HF URL -
Paper URL https://docs.google.com/forms/d/e/1FAIpQLSd3k8wFF4GQP4yo_lDAXKjCltfYk-dE-yYpegTnCB20kr7log/viewform
mrqorib commented 6 months ago

self-assign

holylovenia commented 6 months ago

Adding the bonus+3 flag due to the dataset's massive size and complicated directory structure.