SamuelCahyawijaya commented 7 months ago

Dataloader name: national_speech_corpus_sg_imda/national_speech_corpus_sg_imda.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?national_speech_corpus_sg_imda

Dataset	national_speech_corpus_sg_imda
Description	The National Speech Corpus (NSC) is the first large-scale Singapore English corpus spearheaded by the Info-communications and Media Development Authority (IMDA) of Singapore. It aims to become an important source of open speech data for automatic speech recognition (ASR) research and speech-related applications. The NSC improves speech engines’ accuracy of recognition and transcription for locally accented English. The NSC is also able to contribute to speech synthesis technology where an AI voice can be produced that is more familiar to Singaporeans, with local terms pronounced more accurately. Part 1 features about 1000 hours of prompted recordings of phonetically-balanced scripts from about 1000 local English speakers. Part 2 presents about 1000 hours of prompted recordings of sentences randomly generated from words based on people, food, location, brands, etc, from about 1000 local English speakers as well. Transcriptions of the recordings have been done orthographically and are available for download. Part 3 consists of about 1000 hours of conversational data recorded from about 1000 local English speakers, split into pairs. The data includes conversations covering daily life and of speakers playing games provided. Links to Part 4, 5, 6 are included in the DropBox. Please open the DropBox via the desktop application since the folder is very large.
Subsets	read_balanced, read_pertinent, conversational_f2f, conversational_telephone
Languages	eng
Tasks	Automatic Speech Recognition, Text-To-Speech Synthesis
License	Other (other)
Homepage	https://www.imda.gov.sg/how-we-can-help/national-speech-corpus
HF URL	-
Paper URL	https://docs.google.com/forms/d/e/1FAIpQLSd3k8wFF4GQP4yo_lDAXKjCltfYk-dE-yYpegTnCB20kr7log/viewform

mrqorib commented 6 months ago

self-assign

holylovenia commented 6 months ago

Adding the bonus+3 flag due to the dataset's massive size and complicated directory structure.

SEACrowd / seacrowd-datahub

Create dataset loader for National Speech Corpus #528

self-assign