NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.07k stars 1.38k forks source link

Problem replicating Tacotron 2 recipe for other language pairs #225

Closed nadirdurrani closed 4 years ago

nadirdurrani commented 5 years ago

We are trying to replicate your English results using tacotron2 for Arabic. While we were able to replicate English results seamlessly, we haven't been successful with Arabic. Because our data is not very clean, we recently tried German from the data made available here

https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/

We only used Angela Merkel's data. After a week training, we still haven't found any good results as we do for English. We have no prior experience with TTS so we could be making very silly mistake. I will list the steps we made to adapt the recipe to German and hope that you could point out what the problem is.

1) We modified "symbols.py" inside text to replace English character-set with German. We did not include "_arpabet" when exporting the symbols. Our file looks like

pad = '' _punctuation = ',;:!?/.‘’“()[]\&̈–' _special = '-' _letters = 'aAáäÄbBcCćçdDeEéêfFgGhHiIïjJkKlLmMnNoOôöÖpPqQrRřsSśßtTuUüÜvVwWxXyYzZ' _digits = '0123456789'

Export all symbols:

symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + list(_digits)

2) We modified cleaners.py to add a new routine

def german_utf8_cleaners(text): text = collapse_whitespace(text) return text

This is similar to "basic_cleaners" except that we also don't lowercase the text.

3)We commented ignore_layers=['embedding.weight'], in the hparams.py as we are training from the scratch.

4) We changed the sampling rate to 16000 to match the data.

Any advice on how to make it work, would be very helpful. Are the parameters sensitive to English and we need to play around to make it work for other languages?

vikrantsharma7 commented 5 years ago
  1. The symbols look fine and should work.
  2. The cleaner looks fine.
  3. ignore_layers is only used when you train with --warm_start, so you need not comment it. Instead, using it to resume training from pretrained English model can help to converge faster, check README.
  4. If you change sampling rate, you will also have to adjust hop_length and win_length . Most importantly, you will not be able to use the pretrained WaveGlow model, you will need to train from scratch using the exact same audio parameters as Tacotron2. I would suggest using 22.05 kHz sampled audio files and leaving the audio parameters as they are, they will help with WaveGlow vocoding later on, which improves audio quality dramatically.
nadirdurrani commented 5 years ago

Thank you Vikrant !!!

Unfortunately all of the data we have is 16K. We did try to up sample it to 22K, but that did not work. We also tried down sampling English data to 16K and that still trains fine, although quality deteriorates little. So I am not sure if this is the actual problem.

Any suggestion on what hop_length and win_length to try with the 16K sampling size?

We are training WaveGlow model every time on the side. But we thought we have to train it from scratch every time we train a TTS with the new data. Are you suggesting that WaveGlow model is language, speaker and data independent? And we can even use the English one?

One more question. How much data do we need to train tactoron2?

vikrantsharma7 commented 5 years ago

That is a bit strange, since I've had some success with 16k to 22k upsampled files. 200, 800 would be okay for hop and win. In my experience, the published WaveGlow model works pretty well for languages other than English, and even male voices. You could try mel2samping a few audio files (22k) and infer with the pretrained model to check how they sound. I have no conclusive answer to your last question though. At least 3-4 hours of good quality data should give acceptable results, sometimes it is even very dependent on the speaker and you may need more.

nadirdurrani commented 5 years ago

Hi Vikrant,

I tried 16K, (upsampled 22K), different hop and win parameters as you suggested, but could not get the desired results. I tried it with German, Arabic and Russian (12+ hours of speech) but could not make it work.

Even for English reducing LJS corpus to 3 hours of speech gave very poor results.

rafaelvalle commented 4 years ago

Closing due to inactivity.