The National Speech Corpus (NSC) is the first large-scale Singapore English corpus spearheaded by the Info-communications and Media Development Authority (IMDA) of Singapore. It aims to become an important source of open speech data for automatic speech recognition (ASR) research and speech-related applications. The NSC improves speech engines’ accuracy of recognition and transcription for locally accented English. The NSC is also able to contribute to speech synthesis technology where an AI voice can be produced that is more familiar to Singaporeans, with local terms pronounced more accurately. Part 1 features about 1000 hours of prompted recordings of phonetically-balanced scripts from about 1000 local English speakers. Part 2 presents about 1000 hours of prompted recordings of sentences randomly generated from words based on people, food, location, brands, etc, from about 1000 local English speakers as well. Transcriptions of the recordings have been done orthographically and are available for download. Part 3 consists of about 1000 hours of conversational data recorded from about 1000 local English speakers, split into pairs. The data includes conversations covering daily life and of speakers playing games provided. Links to Part 4, 5, 6 are included in the DropBox. Please open the DropBox via the desktop application since the folder is very large.
Dataloader name:
national_speech_corpus_sg_imda/national_speech_corpus_sg_imda.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?national_speech_corpus_sg_imda