Archive audio+text to download

Mte90 commented 4 years ago

Lists of resources we can implement to add more datasets for DeepSpeech (maybe generate a custom dataset based on Common Voice dataset organization, in the readme there is a sample, or on the fly to avoid license issues):

Check also: https://github.com/MozillaItalia/DeepSpeech-Italian-Model/issues/34

Otherwise we can evaluate this tools to generate a dataset based on youtube:

Another solution is to use https://github.com/srinivr/kaldi-long-audio-alignment with the italia model to auto split text+audio in small fragment to speed up.

The most important part is that the data need to be aggregated to avoid license issue, this means that the files need to be all together and is not possible to recreate the original files.

eziolotta commented 3 years ago

Hi All I found this Italian Corpora: transcription + short clips Works done in various Italian universities.

#####################################################

Dataset with audio sample + transcription + other annotation. links to dataset dump:

Corpus di parlato cinematografico http://www.parlaritaliano.it/index.php/it/dati/659-corpus-di-parlato-cinematografico

Corpus PraTiD http://www.parlaritaliano.it/index.php/en/corpora/645-corpus-pratid

Corpus di parlato telegiornalistico. Anni 60-2005 http://www.parlaritaliano.it/index.php/it/dati/650-corpus-di-parlato-telegiornalistico-anni-sessanta-vs-2005

SpIt-MDb (Spoken Italian - Multilevel Database) http://www.parlaritaliano.it/index.php/en/corpora/644-spit-mdb-spoken-italian-multilevel-database

ZIta-master https://github.com/ChMeluzzi/ZIta

LIM_Veneti-master https://github.com/ChMeluzzi/LIM_Veneti

########################################################

Other Open Datasets, with many hours. I don't see licenses, I think cc0, to be confirmed

No dataset dump available, but they are free available. A little Crawler should be done.

Perugia Corpus (PEC) https://www.unistrapg.it/cqpwebnew/

DB-IPIC: An XML Database for Information Patterning http://www.lablita.it/app/dbipic/

CorDIC: Corpora Didattici Italiani di Confronto http://corporadidattici.lablita.it/run.cgi/first_form?corpname=scritto;lemma=;lpos=

##########################################

it's a work in progress..I am looking for other datasets...

eziolotta commented 3 years ago

On parlaritaliano.it website I did not see the Corpus VoLIP.

This dataset have total 60 hours of transcribed audio, collected in the 90's and re-proposed in this new Corpora. License is not specified, not even in the paper, I don't understand if it's CC0. http://www.lrec-conf.org/proceedings/lrec2014/pdf/906_Paper.pdf

The audio clips, however, have an average duration ranging from a few minutes up to 1 hour about Audio duration is an issue for DeepSpeech model? They are dialogues. Is a problem? Audio is not of excellent quality, but the voice is understood

The transcripts to be extracted would have to be cleaned of other annotations, but it doesn't seem difficult.

Corpus audio+text .zip list: http://www.parlaritaliano.it/index.php/it/visualizza-corpus?path=/Firenze http://www.parlaritaliano.it/index.php/it/visualizza-corpus?path=/Milano http://www.parlaritaliano.it/index.php/it/visualizza-corpus?path=/Napoli http://www.parlaritaliano.it/index.php/it/visualizza-corpus?path=/Roma

Mte90 commented 3 years ago

Looking on Volip, it is probably not CC0 as seems something by University. So if we aggregate that with others it isn't a problem.

About the length issue, I think that it is a problem. As CV and M-Ailabs are not an hour of recording as example.

Mte90 commented 3 years ago

@kamir86 reported that for who use Google assistant is possible to download the recordings with the transcription from https://support.google.com/websearch/answer/6030020?co=GENIE.Platform%3DAndroid&hl=it

Mte90 commented 3 years ago

There is also VoxForge!

eziolotta commented 3 years ago

I have downloaded some datasets to be evaluated. For possible importers that do not require audio-text Segmentation (see also #107 ) I suggest following links:

Evalita2009 5h (only pronunciations numbers) http://www.evalita.it/sites/evalita.fbk.eu/files/doc2009/evalita2009srt.zip

Short clips. Transcripts of numbers is not literal Normalization is needed (ex. '10'->'dieci')
MSPKA Corpus 3h http://www.mspkacorpus.it/ Short clips. Transcripts is clean
SIWIS Database 4.5h https://phonogenres.unige.ch/downloads/siwis_latest.zip
Short clips. Transcripts is clean
SUGAR Corpus 1.5h https://github.com/evalitaunina/SUGAR_Corpus Short clips. Transcripts require clear processing.

eziolotta commented 3 years ago

There is also VoxForge!

Yep, I saw that it's 20 hours of short clips with clean transcripts. More speakers. Should see implementation already available in import_voxforge.py

nefastosaturo commented 3 years ago

I'm closing this for a new issue where all the datasets that have been mentioned here are listed

MozillaItalia / DeepSpeech-Italian-Model

Archive audio+text to download #90