coqui-ai / open-speech-corpora

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies
MIT License
1.28k stars 140 forks source link
speech-emotion-recognition speech-processing speech-recognition speech-separation speech-synthesis speech-to-text stt text-to-speech tts voice-activity-detection voice-cloning voice-recognition

💎 Open Speech Corpora

A list of open speech corpora for Speech Technology research and development.

This list has a preference for free (i.e. no $ cost) and truly open corpora (e.g. released under a Creative Commons license or a Community Data License Agreement). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.

Feel free to propse additions to the list!

There's a long backlog of corpora to be added in the Issues, and Pull Requests are very welcome :)

📜 CC-0

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Common Voice Multilingual >15,000 hours (validated); >20,000 hours (total) Multi-speaker https://voice.mozilla.org/en/datasets CC-0
Yesno Hebrew 6 mins one male http://www.openslr.org/1/ CC-0
LJ Speech Corpus English ~24 hours one female https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 CC-0
NST Danish ASR Database Danish 229,992 utterances 616 speakers original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-19/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/ CC-0
NST Danish Dictation Danish 34,955 utterances 151 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-20/ CC-0
NST Danish Speech Synthesis Danish 4,108 utterances 1 male speaker https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-21/ CC-0
NST Swedish ASR Database Swedish 366,000 utterances 1,000 speakers original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-16/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/ CC-0
NST Swedish Dictation Swedish 45,620 utterances 195 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-17/ CC-0
NST Swedish Speech Synthesis Swedish 5,279 utterances 1 male speaker https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/ CC-0
NST Norwegian ASR Database Norwegian 359,760 utterances 980 speakers original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-13/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/ CC-0
NST Norwegian Dictation Norwegian 33,360 utterances 144 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-14/ CC-0
NST Norwegian Speech Synthesis Norwegian 5,363 utterances 1 male speaker https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-15/ CC-0
NB Tale – Speech Database for Norwegian Norwegian 7,600 utterances + ~12 hours 380 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-31/ CC-0
Norwegian Parliamentary Speech Corpus (v0.1) Norwegian ~59 hours 203 speakers https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/ CC-0
Wikimedia Commons Odia Odia ~8 hours ~20 speakers https://commons.wikimedia.org/wiki/Category:Odia_pronunciation mostly(?) CC-0
Thorsten-21.02-neutral German ~24 hours 1 male speaker https://www.Thorsten-Voice.de CC-0
Thorsten-21.06-emotional German 2.400 utterances (8 emotions) 1 male speaker https://www.Thorsten-Voice.de CC-0

📜 CC-BY

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
ARU Speech Corpus English (UK) 720 utterances / speaker 12 (6 femals; 6 male) http://datacat.liverpool.ac.uk/681/1/ARU_Speech_Corpus_v1_0.zip CC-BY 3.0
Althingi Parliamentary Speech Corpus Icelandic 542 hours and 25 minutes 196 speakers http://www.malfong.is/index.php?dlid=73&lang=en CC-BY 4.0
Alþingisumræður Parliamentary Speech Corpus Icelandic ~21 hours http://www.malfong.is/index.php?dlid=8&lang=en CC-BY 3.0
Hjal Corpus Icelandic ~41,000 recordings 883 speakers http://www.malfong.is/index.php?dlid=5&lang=en CC-BY 3.0
The Malromur Corpus Icelandic 152 hours 563 speakers http://www.malfong.is/index.php?dlid=65&lang=en CC-BY 4.0
Telecooperation German Corpus for Kinect German ~35 hours ~180 speakers http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz CC-BY 2.0
African Speech Technology English-English Speech Corpus English ~21 hours https://repo.sadilar.org/handle/20.500.12185/283 CC-BY 2.5 South Africa
African Speech Technology isiXhosa Speech Corpus isiXhosa ~26 hours https://repo.sadilar.org/handle/20.500.12185/305 CC-BY 2.5 South Africa
NCHLT Afrikaans Afrikaans 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/280 CC-BY 3.0
NCHLT English English 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/274 CC-BY 3.0
NCHLT isiNdebele isiNdebele 56 hours 148 speakers (78 female / 70 male) https://repo.sadilar.org/handle/20.500.12185/272 CC-BY 3.0
NCHLT isiXhosa isiXhosa 56 hours 209 speakers (106 female / 103 male) https://repo.sadilar.org/handle/20.500.12185/279 CC-BY 3.0
NCHLT isiZulu isiZulu 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/275 CC-BY 3.0
NCHLT Sepedi Sepedi 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/270 CC-BY 3.0
NCHLT Sesotho Sesotho 56 hours 210 speakers (113 female / 97 male) https://repo.sadilar.org/handle/20.500.12185/278 CC-BY 3.0
NCHLT Setswana Setswana 56 hours 210 speakers (109 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/281 CC-BY 3.0
NCHLT Siswati Siswati 56 hours 197 speakers (96 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/271 CC-BY 3.0
NCHLT Tshivenda Tshivenda 56 hours 208 speakers (83 female / 125 male) https://repo.sadilar.org/handle/20.500.12185/276 CC-BY 3.0
NCHLT Xitsonga Xitsonga 56 hours 198 speakers (95 female/103 male) https://repo.sadilar.org/handle/20.500.12185/277 CC-BY 3.0
Lwazi II Cross-lingual Proper Name Corpus Afrikaans; English; isiZulu; Sesotho 2 hours 5 mins 20 speakers https://repo.sadilar.org/handle/20.500.12185/445 CC-BY 3.0
Lwazi II Proper Name Call Routing Telephone Corpus English 2 hours 7 mins https://repo.sadilar.org/handle/20.500.12185/448 CC-BY 3.0
Lwazi II Afrikaans Trajectory Tracking Corpus Afrikaans 4 hours one male https://repo.sadilar.org/handle/20.500.12185/442 CC-BY 3.0
LibriSpeech English ~1000 hours 2484 speakers (1201 female / 1283 male) http://www.openslr.org/12/ CC-BY 4.0
Zeroth-Korean Korean 52.8 hours 115 speakers http://www.openslr.org/40/ CC-BY 4.0
Speech Commands English 17.8 hours >1,000 speakers https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html CC-BY 4.0
ParlamentParla Catalan 320 hours https://www.openslr.org/59/ CC-BY 4.0
SIWIS French ~10 hours one female http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip CC-BY 4.0
VCTK English 44 hours 109 speakers http://datashare.is.ed.ac.uk/download/DS_10283_3443.zip CC-BY 4.0
LibriTTS English 586 hours 2,456 speakers (1,185 female / 1,271 male) http://www.openslr.org/60/ CC-BY 4.0
Augmented LibriSpeech Audio (English); Text (English, French) 236 hours https://persyval-platform.univ-grenoble-alpes.fr/datasets/DS91 CC-BY 4.0
Helsinki Prosody Corpus English 262.5 hours 1,230 speakers https://github.com/Helsinki-NLP/prosody CC-BY 4.0
Tuva Speech Database Norwegian 24 hours 40 speakers https://www.nb.no/sprakbanken/show?serial=oai:nb.no:sbr-44&lang= CC-BY 4.0
COERLL Kʼicheʼ corpus Kʼicheʼ 34 minutes ? speakers https://cl.indiana.edu/~ftyers/resources/utexas-kiche-audio.tar.gz CC-BY 4.0
Timers and Such v0.1 English (synthetic: US, real: various nationalities) synthetic: 172 hours, real: 0.29 hours 21 synthetic, 11 real https://zenodo.org/record/4110812#.X9j0RmBOkYM CC-BY 4.0
Large Corpus of Czech Parliament Plenary Hearings Czech 444 hours https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3126 CC-BY 4.0

📜 CC-BY-SA

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Iban Iban 8 hours http://www.openslr.org/24/ https://github.com/sarahjuan/iban CC-BY-SA 2.0
Vystadial 2013 English; Czech 41 hours; 15 hours http://www.openslr.org/6/ CC-BY-SA 3.0 US
Vystadial 2016 Czech Czech 77 hours; includes Vystadial 2013 Czech https://lindat.cz/repository/xmlui/handle/11234/1-1740 CC-BY-SA 4.0
Free Spoken Digit Dataset English 2,000 isolated digits 4 speakers https://github.com/Jakobovski/free-spoken-digit-dataset CC-BY-SA 4.0
Google Javanese Javanese 296 hours 1019 speakers http://www.openslr.org/35/ CC-BY-SA 4.0
Google Nepali Nepali 165 hours 527 speakers http://www.openslr.org/54/ CC-BY-SA 4.0
Google Bengali Bengali 229 hours 508 speakers http://www.openslr.org/53/ CC-BY-SA 4.0
Google Sinhala Sinhala 224 hours 478 speakers http://www.openslr.org/52/ CC-BY-SA 4.0
Google Sundanese Sundanese 333 hours 542 speakers http://www.openslr.org/36/ CC-BY-SA 4.0
Spoken Wikipedia Corpus (SWC-2017) English; German; Dutch 182 hours; 249 hours; 79 hours 395 speakers; 339 speakers; 145 speakers https://nats.gitlab.io/swc/ CC-BY-SA 4.0
Chuvash TTS Chuvash 4 hours 1 speaker https://github.com/ftyers/Turkic_TTS CC-BY-SA 4.0
Forschergeist German 2 hours 2 speakers (1 female; 1 male) female speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz; male speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz CC-BY-SA 4.0
Malayalam Speech Corpus by SMC Malayalam 1:36 hours 75 speakers (3 female, 12 male, 60 unidentified) https://releases.smc.org.in/msc-reviewed-speech/ CC-BY-SA 4.0
Google Malayalam Malayalam 3.02 hours 24 speakers http://www.openslr.org/63/ CC-BY-SA 4.0

📜 CC-BY-ND

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
IBM Recorded Debates v1 English 5 hours 10 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND
IBM Recorded Debates v2 English ~14 hours 14 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND

📜 CC-BY-NC

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
TV3Parla Catalan 240 hours http://laklak.eu/share/tv3_0.3.tar.gz CC-BY-NC 4.0
Russian Open STT Corpus Russian ~10,000 hours public, ~10,000 more upon request https://github.com/snakers4/open_stt/#links CC-BY-NC 4.0 with some exceptions
Russian Open TTS Corpus Russian 145 hours 3 males https://github.com/snakers4/open_tts/#links CC-BY-NC 4.0 with some expections
OVM – Otázky Václava Moravce Czech 35 hours https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-000D-EC98-3 CC-BY-NC 3.0

📜 CC-BY-NC-SA

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
CHiME-Home English 6.8 hours https://archive.org/details/chime-home CC-BY-NC-SA 3.0
Cameroon Pidgin English Corpus Cameroon Pidgin English ~17 hours http://ota.ox.ac.uk/text/2563.zip CC-BY-NC-SA 3.0

📜 CC-BY-NC-ND

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Tatoeba-Eng English ~250 hours (rough estimate) 6 speakers https://voice.mozilla.org/en/datasets CC-BY-NC 4.0 (some audio) / CC-BY-NC-ND 3.0 (most audio) / CC-BY 2.0 (all text)
TED-LIUM English 118 hours 685 speakers (36h female / 81h male) http://www.openslr.org/7/ CC-BY-NC-ND 3.0
TED-LIUM-2 English 207 hours 1242 speakers (66h female / 141h male) http://www.openslr.org/19/ CC-BY-NC-ND 3.0
TED-LIUM-3 English 452 hours 2028 speakers (134h female / 316h male) http://www.openslr.org/51/ CC-BY-NC-ND 3.0
Pansori TEDxKR Korean 3 hours 41 speakers http://www.openslr.org/58/ CC-BY-NC-ND 4.0
Primewords Mandarin Mandarin 100 hours 296 speakers http://www.openslr.org/47/ CC-BY-NC-ND 4.0
MuST-C v1.0 Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair https://ict.fbk.eu/must-c-release-v1-0/ CC-BY-NC-ND 4.0
Czech Parliament Meetings Czech 88 hours https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0005-CF9C-4 CC-BY-NC-ND 3.0
BembaSpeech Bemba 24 hours 17 speakers (9 male / 8 female) https://github.com/csikasote/BembaSpeech CC-BY-NC-ND 4.0

📜 CDLA-Permissive

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
DiPCo English ~5 hours 32 speakers (13 female; 19 male) https://s3.amazonaws.com/dipco/DiPCo.tgz CDLA-Permissive-1.0

📜 GNU General Public License

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
VoxForge English ~120 hours ~2966 speakers http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/ https://voice.mozilla.org/en/datasets GNU-GPL 3.0
VoxForge Russian http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/ http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/ GNU-GPL 3.0
VoxForge German http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/ GNU-GPL 3.0

📜 Apache License

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
AISHELL-1 Mandarin 170 hours 400 speakers http://www.openslr.org/33/ Apache 2.0
Tunisian_MSA Modern Standard Arabic (Tunisia) 11.2 hours 118 speakers http://www.openslr.org/46/ Apache 2.0
African Accented French French 22 hours 232 speakers http://www.openslr.org/57/ Apache 2.0
THCHS-30 Mandarin Chinese 33.57 hours (13,389 utterances) 40 speakers (31 female; 9 male) http://www.openslr.org/18/ Apache 2.0
Living Audio Dataset - Dutch Dutch 57:49 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0
Living Audio Dataset - English English 50:50 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0
Living Audio Dataset - Irish Irish 61:56 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0
Living Audio Dataset - Russian Russian 34:58 min 1 speaker https://github.com/Idlak/Living-Audio-Dataset Apache 2.0

📜 MIT License

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
ALFFA Amharic;Hausa (paid); Swahili; Wolof http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC MIT

📜 BSD 3-Clause License

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
M-AILABS German Corpus German 237 hours and 22 minutes http://www.caito.de/data/Training/stt_tts/de_DE.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Queen's English Corpus Queen's English 45 hours and 35 minutes http://www.caito.de/data/Training/stt_tts/en_UK.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS US English Corpus American English 102 hours and 7 minutes http://www.caito.de/data/Training/stt_tts/en_US.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Spanish Corpus Spanish Spanish 108 hours and 34 minutes http://www.caito.de/data/Training/stt_tts/es_ES.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Italian Corpus Italian 127 hours and 40 minutes http://www.caito.de/data/Training/stt_tts/it_IT.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Ukrainian Corpus Ukrainian 87 hours and 8 minutes http://www.caito.de/data/Training/stt_tts/uk_UK.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Russian Corpus Russian 46 hours and 47 minutes http://www.caito.de/data/Training/stt_tts/ru_RU.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS French-v0.9 Corpus French 190 hours and 30 minutes http://www.caito.de/data/Training/stt_tts/fr_FR.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Polish Corpus Polish 53 hours and 50 minutes http://www.caito.de/data/Training/stt_tts/pl_PL.tgz M-AILABS LICENSE (a data-specific BSD 3-Clause License)

📜 Custom License

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Fluent Speech Commands Corpus English 19 hours (30,043 utterances) 97 speakers http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz Fluent Speech Commands Public License
CMU Wilderness 700 Langs Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours https://github.com/festvox/datasets-CMU_Wilderness https://live.bible.is/terms
CHiME-5 English 50 hours 48 speakers http://spandh.dcs.shef.ac.uk/chime_challenge/data.html CHiME-5 License
Fearless Steps Corpus English 19,000 hours (20 hours transcribed) ~450 speakers https://fearless-steps.github.io/ChallengePhase3/#19k_Corpus_Access NASA Media Usage Guidelines
Microsoft Speech Corpus (Indian languages) Telugu; Tamil; Gujarati https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e Microsoft Speech Corpus (Indian Languages) License
Microsoft Speech Language Translation Corpus English; Chinese; Japanese https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187 Microsoft Research Data License Agreement
Hey Snips Corpus English 11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances 2215 speakers (positive & negative) and 4028 speakers (negative only) https://research.snips.ai/datasets/keyword-spotting Snips Data License
Snips SLU Corpus English; French 1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances English: 69 speakers; French: 30 speakers https://research.snips.ai/datasets/spoken-language-understanding Snips Data License
CMU Sphinx Group - AN4 English "an4_clstk"(~50 minutes) "an4test_clstk" (~6 minutes) "an4_clstk": 21 female, 53 male "an4test_clstk": 3 female, 7 male http://www.speech.cs.cmu.edu/databases/an4/an4_raw.bigendian.tar.gz AN4
FT Speech Danish ~1,857 hours (1,017,244 utterances) 434 speakers (176 female, 258 male) https://ftspeech.dk FT Speech License
FalaBrasil-LAPS-Constituicao Brazilian-Portuguese 9 hours 1 speaker https://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPSMail Brazilian-Portuguese 1 hour 25 speakers https://drive.google.com/uc?export=download&confirm=PecV&id=1B_Vq8MDSE4fBQefVxqCGSl-EcKAcjJLb "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPS Benchmark Brazilian-Portuguese 1 hour 1 speaker https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita] / Portanto, apenas as bases livres estão sendo disponibilizadas."