Support Common Voice dataset out of the box

MycroftAI / mimic2

Text to Speech engine based on the Tacotron architecture, initially implemented by Keith Ito.

Apache License 2.0

580 stars 103 forks source link

Support Common Voice dataset out of the box #44

Closed stefangrotz closed 4 years ago

stefangrotz commented 4 years ago

The common voice project is meant for voice recognition, but some donors are recording tenth of hours and it is possible to separate their recordings out of the dataset. If we could train mimic2 based on single donors from the Common Voice project this could enable us to create voices in a large number of different languages.

https://voice.mozilla.org/en/datasets

el-tocino commented 4 years ago

You can do this if so motivated now without too much difficulty. One problem is that there's not many samples relative to a typical dataset for tacotron even from the larger submitters.

stefangrotz commented 4 years ago

I will try it. The sound files are in mp3, will this be a problem?

el-tocino commented 4 years ago

No, just convert them with sox or ffmpeg to wav.

el-tocino commented 4 years ago

One more thing I neglected to mention. The quality of the data for 99% of the clips is poor. I don't have handy links but if you review other tacotron repos about what makes good datasets, clean recordings is one of the big things. The majority of the CV clips are not great in that regards. Dirty data results in poor alignment or poor synthesis.

IF you want to try building your own voice, look into the LJ or Nancy datasets, or compile your own from clean recordings.

el-tocino commented 4 years ago

This should get you pretty close

### the spacing after the -d is single quote, control-v, tab, single quote
grep $UID validated.tsv | cut -d'  ' -f2,3,4,5 | grep 0$  | tr -s '\011' '\174' | cut -d'|'  -f1,2 >> my-training-sentences.out
mkdir wavs
for i in $(cat my-training-sentences.out | cut -d'|' -f1);  do
  sox clips/$i -c 1 -b 16 -r 16000 wavs/$i.wav
  ts=$(grep ^$i my-training-sentences.out | cut -d'|' -f2| sed -e 's/[[:punct:]]//g')
  echo "$i.wav"'|'"$ts."'|'"$ts." >> sentences.txt
done

stefangrotz commented 4 years ago

That's great, thanks.

I know that CV is not ideal, but I want to experiment with diferent languages and there are very few other suitable datasets available for that right now. For example for me it would be very interesting to see if the complete regular constructed language Esperanto creates better results than natural languages because there are no irregularities or exceptions that cause complexity. Or maybe this doesn't have much impact and the natural voice itself is the hard thing do create. I just want to experiment a little with this project to understand it better and for that purpose the quality will be enough.

I will write about my results here, maybe this becomes usefull in the future for others when the Dataset of CV has become bigger. A few donors really use good microphones but we will see how good this actually works. There are a few mass donors with good microphones in the German dataset already.

el-tocino commented 4 years ago

You should also check out the https://github.com/keithito/tacotron and https://github.com/Rayhane-mamah/Tacotron-2 (and any other tacotron repo out there) for issues to see what others experiences have been. https://discourse.mozilla.org/c/tts and https://discourse.mozilla.org/c/voice also of note.

el-tocino commented 4 years ago

Definitely check out this post: https://discourse.mozilla.org/t/data-and-training-considerations-to-improve-voice-naturalness/45313/21