Closed Mte90 closed 3 years ago
Hi All I found this Italian Corpora: transcription + short clips Works done in various Italian universities.
#####################################################
Dataset with audio sample + transcription + other annotation. links to dataset dump:
Corpus di parlato cinematografico http://www.parlaritaliano.it/index.php/it/dati/659-corpus-di-parlato-cinematografico
Corpus PraTiD http://www.parlaritaliano.it/index.php/en/corpora/645-corpus-pratid
Corpus di parlato telegiornalistico. Anni 60-2005 http://www.parlaritaliano.it/index.php/it/dati/650-corpus-di-parlato-telegiornalistico-anni-sessanta-vs-2005
SpIt-MDb (Spoken Italian - Multilevel Database) http://www.parlaritaliano.it/index.php/en/corpora/644-spit-mdb-spoken-italian-multilevel-database
ZIta-master https://github.com/ChMeluzzi/ZIta
LIM_Veneti-master https://github.com/ChMeluzzi/LIM_Veneti
########################################################
Other Open Datasets, with many hours. I don't see licenses, I think cc0, to be confirmed
No dataset dump available, but they are free available. A little Crawler should be done.
Perugia Corpus (PEC) https://www.unistrapg.it/cqpwebnew/
DB-IPIC: An XML Database for Information Patterning http://www.lablita.it/app/dbipic/
CorDIC: Corpora Didattici Italiani di Confronto http://corporadidattici.lablita.it/run.cgi/first_form?corpname=scritto;lemma=;lpos=
##########################################
it's a work in progress..I am looking for other datasets...
On parlaritaliano.it website I did not see the Corpus VoLIP.
This dataset have total 60 hours of transcribed audio, collected in the 90's and re-proposed in this new Corpora. License is not specified, not even in the paper, I don't understand if it's CC0. http://www.lrec-conf.org/proceedings/lrec2014/pdf/906_Paper.pdf
The audio clips, however, have an average duration ranging from a few minutes up to 1 hour about Audio duration is an issue for DeepSpeech model? They are dialogues. Is a problem? Audio is not of excellent quality, but the voice is understood
The transcripts to be extracted would have to be cleaned of other annotations, but it doesn't seem difficult.
Corpus audio+text .zip list: http://www.parlaritaliano.it/index.php/it/visualizza-corpus?path=/Firenze http://www.parlaritaliano.it/index.php/it/visualizza-corpus?path=/Milano http://www.parlaritaliano.it/index.php/it/visualizza-corpus?path=/Napoli http://www.parlaritaliano.it/index.php/it/visualizza-corpus?path=/Roma
Looking on Volip, it is probably not CC0 as seems something by University. So if we aggregate that with others it isn't a problem.
About the length issue, I think that it is a problem. As CV and M-Ailabs are not an hour of recording as example.
@kamir86 reported that for who use Google assistant is possible to download the recordings with the transcription from https://support.google.com/websearch/answer/6030020?co=GENIE.Platform%3DAndroid&hl=it
There is also VoxForge!
I have downloaded some datasets to be evaluated. For possible importers that do not require audio-text Segmentation (see also #107 ) I suggest following links:
Evalita2009 5h (only pronunciations numbers) http://www.evalita.it/sites/evalita.fbk.eu/files/doc2009/evalita2009srt.zip
Short clips. Transcripts of numbers is not literal Normalization is needed (ex. '10'->'dieci')
MSPKA Corpus 3h http://www.mspkacorpus.it/ Short clips. Transcripts is clean
SIWIS Database 4.5h
https://phonogenres.unige.ch/downloads/siwis_latest.zip
Short clips. Transcripts is clean
SUGAR Corpus 1.5h https://github.com/evalitaunina/SUGAR_Corpus Short clips. Transcripts require clear processing.
There is also VoxForge!
Yep, I saw that it's 20 hours of short clips with clean transcripts. More speakers. Should see implementation already available in import_voxforge.py
I'm closing this for a new issue where all the datasets that have been mentioned here are listed
Lists of resources we can implement to add more datasets for DeepSpeech (maybe generate a custom dataset based on Common Voice dataset organization, in the readme there is a sample, or on the fly to avoid license issues):
Check also: https://github.com/MozillaItalia/DeepSpeech-Italian-Model/issues/34
Otherwise we can evaluate this tools to generate a dataset based on youtube:
Another solution is to use https://github.com/srinivr/kaldi-long-audio-alignment with the italia model to auto split text+audio in small fragment to speed up.
The most important part is that the data need to be aggregated to avoid license issue, this means that the files need to be all together and is not possible to recreate the original files.