💎 Open Speech Corpora

A list of open speech corpora for Speech Technology research and development.

This list has a preference for free (i.e. no $ cost) and truly open corpora (e.g. released under a Creative Commons license or a Community Data License Agreement). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.

Feel free to propse additions to the list!

There's a long backlog of corpora to be added in the Issues, and Pull Requests are very welcome :)

📜 CC-0

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
Common Voice	Multilingual	>15,000 hours (validated); >20,000 hours (total)	Multi-speaker	https://voice.mozilla.org/en/datasets	CC-0
Yesno	Hebrew	6 mins	one male	http://www.openslr.org/1/	CC-0
LJ Speech Corpus	English	~24 hours	one female	https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2	CC-0
NST Danish ASR Database	Danish	229,992 utterances	616 speakers	original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-19/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/	CC-0
NST Danish Dictation	Danish	34,955 utterances	151 speakers	https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-20/	CC-0
NST Danish Speech Synthesis	Danish	4,108 utterances	1 male speaker	https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-21/	CC-0
NST Swedish ASR Database	Swedish	366,000 utterances	1,000 speakers	original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-16/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/	CC-0
NST Swedish Dictation	Swedish	45,620 utterances	195 speakers	https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-17/	CC-0
NST Swedish Speech Synthesis	Swedish	5,279 utterances	1 male speaker	https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/	CC-0
NST Norwegian ASR Database	Norwegian	359,760 utterances	980 speakers	original: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-13/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/	CC-0
NST Norwegian Dictation	Norwegian	33,360 utterances	144 speakers	https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-14/	CC-0
NST Norwegian Speech Synthesis	Norwegian	5,363 utterances	1 male speaker	https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-15/	CC-0
NB Tale – Speech Database for Norwegian	Norwegian	7,600 utterances + ~12 hours	380 speakers	https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-31/	CC-0
Norwegian Parliamentary Speech Corpus (v0.1)	Norwegian	~59 hours	203 speakers	https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/	CC-0
Wikimedia Commons Odia	Odia	~8 hours	~20 speakers	https://commons.wikimedia.org/wiki/Category:Odia_pronunciation	mostly(?) CC-0
Thorsten-21.02-neutral	German	~24 hours	1 male speaker	https://www.Thorsten-Voice.de	CC-0
Thorsten-21.06-emotional	German	2.400 utterances (8 emotions)	1 male speaker	https://www.Thorsten-Voice.de	CC-0

📜 CC-BY

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
ARU Speech Corpus	English (UK)	720 utterances / speaker	12 (6 femals; 6 male)	http://datacat.liverpool.ac.uk/681/1/ARU_Speech_Corpus_v1_0.zip	CC-BY 3.0
Althingi Parliamentary Speech Corpus	Icelandic	542 hours and 25 minutes	196 speakers	http://www.malfong.is/index.php?dlid=73&lang=en	CC-BY 4.0
Alþingisumræður Parliamentary Speech Corpus	Icelandic	~21 hours		http://www.malfong.is/index.php?dlid=8&lang=en	CC-BY 3.0
Hjal Corpus	Icelandic	~41,000 recordings	883 speakers	http://www.malfong.is/index.php?dlid=5&lang=en	CC-BY 3.0
The Malromur Corpus	Icelandic	152 hours	563 speakers	http://www.malfong.is/index.php?dlid=65&lang=en	CC-BY 4.0
Telecooperation German Corpus for Kinect	German	~35 hours	~180 speakers	http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz	CC-BY 2.0
African Speech Technology English-English Speech Corpus	English	~21 hours		https://repo.sadilar.org/handle/20.500.12185/283	CC-BY 2.5 South Africa
African Speech Technology isiXhosa Speech Corpus	isiXhosa	~26 hours		https://repo.sadilar.org/handle/20.500.12185/305	CC-BY 2.5 South Africa
NCHLT Afrikaans	Afrikaans	56 hours	210 speakers (98 female / 112 male)	https://repo.sadilar.org/handle/20.500.12185/280	CC-BY 3.0
NCHLT English	English	56 hours	210 speakers (100 female / 110 male)	https://repo.sadilar.org/handle/20.500.12185/274	CC-BY 3.0
NCHLT isiNdebele	isiNdebele	56 hours	148 speakers (78 female / 70 male)	https://repo.sadilar.org/handle/20.500.12185/272	CC-BY 3.0
NCHLT isiXhosa	isiXhosa	56 hours	209 speakers (106 female / 103 male)	https://repo.sadilar.org/handle/20.500.12185/279	CC-BY 3.0
NCHLT isiZulu	isiZulu	56 hours	210 speakers (98 female / 112 male)	https://repo.sadilar.org/handle/20.500.12185/275	CC-BY 3.0
NCHLT Sepedi	Sepedi	56 hours	210 speakers (100 female / 110 male)	https://repo.sadilar.org/handle/20.500.12185/270	CC-BY 3.0
NCHLT Sesotho	Sesotho	56 hours	210 speakers (113 female / 97 male)	https://repo.sadilar.org/handle/20.500.12185/278	CC-BY 3.0
NCHLT Setswana	Setswana	56 hours	210 speakers (109 female / 101 male)	https://repo.sadilar.org/handle/20.500.12185/281	CC-BY 3.0
NCHLT Siswati	Siswati	56 hours	197 speakers (96 female / 101 male)	https://repo.sadilar.org/handle/20.500.12185/271	CC-BY 3.0
NCHLT Tshivenda	Tshivenda	56 hours	208 speakers (83 female / 125 male)	https://repo.sadilar.org/handle/20.500.12185/276	CC-BY 3.0
NCHLT Xitsonga	Xitsonga	56 hours	198 speakers (95 female/103 male)	https://repo.sadilar.org/handle/20.500.12185/277	CC-BY 3.0
Lwazi II Cross-lingual Proper Name Corpus	Afrikaans; English; isiZulu; Sesotho	2 hours 5 mins	20 speakers	https://repo.sadilar.org/handle/20.500.12185/445	CC-BY 3.0
Lwazi II Proper Name Call Routing Telephone Corpus	English	2 hours 7 mins		https://repo.sadilar.org/handle/20.500.12185/448	CC-BY 3.0
Lwazi II Afrikaans Trajectory Tracking Corpus	Afrikaans	4 hours	one male	https://repo.sadilar.org/handle/20.500.12185/442	CC-BY 3.0
LibriSpeech	English	~1000 hours	2484 speakers (1201 female / 1283 male)	http://www.openslr.org/12/	CC-BY 4.0
Zeroth-Korean	Korean	52.8 hours	115 speakers	http://www.openslr.org/40/	CC-BY 4.0
Speech Commands	English	17.8 hours	>1,000 speakers	https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html	CC-BY 4.0
ParlamentParla	Catalan	320 hours		https://www.openslr.org/59/	CC-BY 4.0
SIWIS	French	~10 hours	one female	http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip	CC-BY 4.0
VCTK	English	44 hours	109 speakers	http://datashare.is.ed.ac.uk/download/DS_10283_3443.zip	CC-BY 4.0
LibriTTS	English	586 hours	2,456 speakers (1,185 female / 1,271 male)	http://www.openslr.org/60/	CC-BY 4.0
Augmented LibriSpeech	Audio (English); Text (English, French)	236 hours		https://persyval-platform.univ-grenoble-alpes.fr/datasets/DS91	CC-BY 4.0
Helsinki Prosody Corpus	English	262.5 hours	1,230 speakers	https://github.com/Helsinki-NLP/prosody	CC-BY 4.0
Tuva Speech Database	Norwegian	24 hours	40 speakers	https://www.nb.no/sprakbanken/show?serial=oai:nb.no:sbr-44&lang=	CC-BY 4.0
COERLL Kʼicheʼ corpus	Kʼicheʼ	34 minutes	? speakers	https://cl.indiana.edu/~ftyers/resources/utexas-kiche-audio.tar.gz	CC-BY 4.0
Timers and Such v0.1	English (synthetic: US, real: various nationalities)	synthetic: 172 hours, real: 0.29 hours	21 synthetic, 11 real	https://zenodo.org/record/4110812#.X9j0RmBOkYM	CC-BY 4.0
Large Corpus of Czech Parliament Plenary Hearings	Czech	444 hours		https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3126	CC-BY 4.0

📜 CC-BY-SA

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
Iban	Iban	8 hours		http://www.openslr.org/24/ https://github.com/sarahjuan/iban	CC-BY-SA 2.0
Vystadial 2013	English; Czech	41 hours; 15 hours		http://www.openslr.org/6/	CC-BY-SA 3.0 US
Vystadial 2016 Czech	Czech	77 hours; includes Vystadial 2013 Czech		https://lindat.cz/repository/xmlui/handle/11234/1-1740	CC-BY-SA 4.0
Free Spoken Digit Dataset	English	2,000 isolated digits	4 speakers	https://github.com/Jakobovski/free-spoken-digit-dataset	CC-BY-SA 4.0
Google Javanese	Javanese	296 hours	1019 speakers	http://www.openslr.org/35/	CC-BY-SA 4.0
Google Nepali	Nepali	165 hours	527 speakers	http://www.openslr.org/54/	CC-BY-SA 4.0
Google Bengali	Bengali	229 hours	508 speakers	http://www.openslr.org/53/	CC-BY-SA 4.0
Google Sinhala	Sinhala	224 hours	478 speakers	http://www.openslr.org/52/	CC-BY-SA 4.0
Google Sundanese	Sundanese	333 hours	542 speakers	http://www.openslr.org/36/	CC-BY-SA 4.0
Spoken Wikipedia Corpus (SWC-2017)	English; German; Dutch	182 hours; 249 hours; 79 hours	395 speakers; 339 speakers; 145 speakers	https://nats.gitlab.io/swc/	CC-BY-SA 4.0
Chuvash TTS	Chuvash	4 hours	1 speaker	https://github.com/ftyers/Turkic_TTS	CC-BY-SA 4.0
Forschergeist	German	2 hours	2 speakers (1 female; 1 male)	female speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz; male speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz	CC-BY-SA 4.0
Malayalam Speech Corpus by SMC	Malayalam	1:36 hours	75 speakers (3 female, 12 male, 60 unidentified)	https://releases.smc.org.in/msc-reviewed-speech/	CC-BY-SA 4.0
Google Malayalam	Malayalam	3.02 hours	24 speakers	http://www.openslr.org/63/	CC-BY-SA 4.0

📜 CC-BY-ND

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
IBM Recorded Debates v1	English	5 hours	10 speakers	https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis	CC-BY-ND
IBM Recorded Debates v2	English	~14 hours	14 speakers	https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis	CC-BY-ND

📜 CC-BY-NC

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
TV3Parla	Catalan	240 hours		http://laklak.eu/share/tv3_0.3.tar.gz	CC-BY-NC 4.0
Russian Open STT Corpus	Russian	~10,000 hours public, ~10,000 more upon request		https://github.com/snakers4/open_stt/#links	CC-BY-NC 4.0 with some exceptions
Russian Open TTS Corpus	Russian	145 hours	3 males	https://github.com/snakers4/open_tts/#links	CC-BY-NC 4.0 with some expections
OVM – Otázky Václava Moravce	Czech	35 hours		https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-000D-EC98-3	CC-BY-NC 3.0

📜 CC-BY-NC-SA

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
CHiME-Home	English	6.8 hours		https://archive.org/details/chime-home	CC-BY-NC-SA 3.0
Cameroon Pidgin English Corpus	Cameroon Pidgin English	~17 hours		http://ota.ox.ac.uk/text/2563.zip	CC-BY-NC-SA 3.0

📜 CC-BY-NC-ND

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
Tatoeba-Eng	English	~250 hours (rough estimate)	6 speakers	https://voice.mozilla.org/en/datasets	CC-BY-NC 4.0 (some audio) / CC-BY-NC-ND 3.0 (most audio) / CC-BY 2.0 (all text)
TED-LIUM	English	118 hours	685 speakers (36h female / 81h male)	http://www.openslr.org/7/	CC-BY-NC-ND 3.0
TED-LIUM-2	English	207 hours	1242 speakers (66h female / 141h male)	http://www.openslr.org/19/	CC-BY-NC-ND 3.0
TED-LIUM-3	English	452 hours	2028 speakers (134h female / 316h male)	http://www.openslr.org/51/	CC-BY-NC-ND 3.0
Pansori TEDxKR	Korean	3 hours	41 speakers	http://www.openslr.org/58/	CC-BY-NC-ND 4.0
Primewords Mandarin	Mandarin	100 hours	296 speakers	http://www.openslr.org/47/	CC-BY-NC-ND 4.0
MuST-C v1.0	Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish)	408, 504, 492, 465, 442, 385, 432, 489 hours per language pair		https://ict.fbk.eu/must-c-release-v1-0/	CC-BY-NC-ND 4.0
Czech Parliament Meetings	Czech	88 hours		https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0005-CF9C-4	CC-BY-NC-ND 3.0
BembaSpeech	Bemba	24 hours	17 speakers (9 male / 8 female)	https://github.com/csikasote/BembaSpeech	CC-BY-NC-ND 4.0

📜 CDLA-Permissive

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
DiPCo	English	~5 hours	32 speakers (13 female; 19 male)	https://s3.amazonaws.com/dipco/DiPCo.tgz	CDLA-Permissive-1.0

📜 GNU General Public License

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
VoxForge	English	~120 hours	~2966 speakers	http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/ https://voice.mozilla.org/en/datasets	GNU-GPL 3.0
VoxForge	Russian			http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/ http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/	GNU-GPL 3.0
VoxForge	German			http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/	GNU-GPL 3.0

📜 Apache License

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
AISHELL-1	Mandarin	170 hours	400 speakers	http://www.openslr.org/33/	Apache 2.0
Tunisian_MSA	Modern Standard Arabic (Tunisia)	11.2 hours	118 speakers	http://www.openslr.org/46/	Apache 2.0
African Accented French	French	22 hours	232 speakers	http://www.openslr.org/57/	Apache 2.0
THCHS-30	Mandarin Chinese	33.57 hours (13,389 utterances)	40 speakers (31 female; 9 male)	http://www.openslr.org/18/	Apache 2.0
Living Audio Dataset - Dutch	Dutch	57:49 min	1 speaker	https://github.com/Idlak/Living-Audio-Dataset	Apache 2.0
Living Audio Dataset - English	English	50:50 min	1 speaker	https://github.com/Idlak/Living-Audio-Dataset	Apache 2.0
Living Audio Dataset - Irish	Irish	61:56 min	1 speaker	https://github.com/Idlak/Living-Audio-Dataset	Apache 2.0
Living Audio Dataset - Russian	Russian	34:58 min	1 speaker	https://github.com/Idlak/Living-Audio-Dataset	Apache 2.0

📜 MIT License

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
ALFFA	Amharic;Hausa (paid); Swahili; Wolof			http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC	MIT

📜 BSD 3-Clause License

CORPUS	LANGUAGES	# HOURS	DOWNLOAD	LICENSE
M-AILABS German Corpus	German	237 hours and 22 minutes	http://www.caito.de/data/Training/stt_tts/de_DE.tgz	M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Queen's English Corpus	Queen's English	45 hours and 35 minutes	http://www.caito.de/data/Training/stt_tts/en_UK.tgz	M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS US English Corpus	American English	102 hours and 7 minutes	http://www.caito.de/data/Training/stt_tts/en_US.tgz	M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Spanish Corpus	Spanish Spanish	108 hours and 34 minutes	http://www.caito.de/data/Training/stt_tts/es_ES.tgz	M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Italian Corpus	Italian	127 hours and 40 minutes	http://www.caito.de/data/Training/stt_tts/it_IT.tgz	M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Ukrainian Corpus	Ukrainian	87 hours and 8 minutes	http://www.caito.de/data/Training/stt_tts/uk_UK.tgz	M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Russian Corpus	Russian	46 hours and 47 minutes	http://www.caito.de/data/Training/stt_tts/ru_RU.tgz	M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS French-v0.9 Corpus	French	190 hours and 30 minutes	http://www.caito.de/data/Training/stt_tts/fr_FR.tgz	M-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Polish Corpus	Polish	53 hours and 50 minutes	http://www.caito.de/data/Training/stt_tts/pl_PL.tgz	M-AILABS LICENSE (a data-specific BSD 3-Clause License)

📜 Custom License

CORPUS	LANGUAGES	# HOURS	# SPEAKERS	DOWNLOAD	LICENSE
Fluent Speech Commands Corpus	English	19 hours (30,043 utterances)	97 speakers	http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz	Fluent Speech Commands Public License
CMU Wilderness	700 Langs	Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours		https://github.com/festvox/datasets-CMU_Wilderness	https://live.bible.is/terms
CHiME-5	English	50 hours	48 speakers	http://spandh.dcs.shef.ac.uk/chime_challenge/data.html	CHiME-5 License
Fearless Steps Corpus	English	19,000 hours (20 hours transcribed)	~450 speakers	https://fearless-steps.github.io/ChallengePhase3/#19k_Corpus_Access	NASA Media Usage Guidelines
Microsoft Speech Corpus (Indian languages)	Telugu; Tamil; Gujarati			https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e	Microsoft Speech Corpus (Indian Languages) License
Microsoft Speech Language Translation Corpus	English; Chinese; Japanese			https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187	Microsoft Research Data License Agreement
Hey Snips Corpus	English	11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances	2215 speakers (positive & negative) and 4028 speakers (negative only)	https://research.snips.ai/datasets/keyword-spotting	Snips Data License
Snips SLU Corpus	English; French	1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances	English: 69 speakers; French: 30 speakers	https://research.snips.ai/datasets/spoken-language-understanding	Snips Data License
CMU Sphinx Group - AN4	English	"an4_clstk"(~50 minutes) "an4test_clstk" (~6 minutes)	"an4_clstk": 21 female, 53 male "an4test_clstk": 3 female, 7 male	http://www.speech.cs.cmu.edu/databases/an4/an4_raw.bigendian.tar.gz	AN4
FT Speech	Danish	~1,857 hours (1,017,244 utterances)	434 speakers (176 female, 258 male)	https://ftspeech.dk	FT Speech License
FalaBrasil-LAPS-Constituicao	Brazilian-Portuguese	9 hours	1 speaker	https://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT	"Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPSMail	Brazilian-Portuguese	1 hour	25 speakers	https://drive.google.com/uc?export=download&confirm=PecV&id=1B_Vq8MDSE4fBQefVxqCGSl-EcKAcjJLb	"Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPS Benchmark	Brazilian-Portuguese	1 hour	1 speaker	https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo	"Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita] / Portanto, apenas as bases livres estão sendo disponibilizadas."

coqui-ai / open-speech-corpora

readme