End-to-end WaveRNN? - Githubissues

echelon commented 5 years ago

Pardon the questions, but I'm new to machine learning and am trying to familiarize myself with the field.

Is WaveRNN (here and in the original paper) not an end-to-end text to speech system like Tacotron? It looks like the inputs to synthesize.py are the model and a mel spectrogram. If this were incorporated as part of an end-to-end system, would something convert text into a mel spectrogram to feed into the WaveRNN?

I'm interested in WaveRNN because of its promise as a fast CPU-only vocoder. Is this currently the state of the art for synthesis speed? I want to contribute a Rust implementation that can run cheaply on the server.

geneing commented 5 years ago

@echelon WaveRNN is only a vocoder (same as WaveNet, FFTNet, SampleRNN, etc). It starts with linguistic features (usually Mel spectrograms) and produces sound. Linguistic features may be produced by a different model (e.g. Tacotron-2, Deep Voice 3, etc). One can combine the two stages into a single network and train end-to-end network, but it is much more challenging than training each stage separately. However, fine-tuning an end-to-end network may be doable and may improve speech quality.

If you are to train an end-to-end network, you wouldn't use synthesize.py, because you need to backpropagate errors from vocoder to linguistic mode during training.

Most of my work is done in this repo WaveRNN-Pytorch. It contains my work on reducing model complexity and my C++ inference implementation that runs in real time on a single CPU core.

I'm also working on combining https://github.com/geneing/Tacotron-2 and https://github.com/geneing/WaveRNN-Pytorch.

TTS and voice generation is one of the harder problems in machine learning, so it may not be the best place to start learning the field.

echelon commented 5 years ago

Thanks so much, @geneing ! This really aids my understanding. I know this is a really difficult domain. I’ve previously implemented a performant concatenative text to speech system that leveraged CMU’s ARPABET and unit selection, but the audio quality was dodgy, which is why I’m exploring the ML approach to speech synthesis.

I’ve been playing with Keith Ito’s implementation of Tacotron 2 and have pulled out the Griffin-Lim component,

def inv_spectrogram_tensorflow(spectrogram):
  S = _db_to_amp_tensorflow(_denormalize_tensorflow(spectrogram) + hparams.ref_level_db)
  return _griffin_lim_tensorflow(tf.pow(S, hparams.power))

Instead of generating audio from text, I short circuit the pipeline and instead write the spectrograms out to image files (by mapping their values into the range [0, 255]). They visibly look like speech spectrograms, so I think I’ve found the correct boundary for the text -> Mel spectrogram piece of the end-to-end synthesis pipeline.

Next I’m going to try to shell out or otherwise call your WavRNN code to transform these spectrograms into audio. Does any processing or normalization have to happen to the spectograms generated by the Tacotron-2 pipeline before WavRNN can handle them? Or will this approach not work out of the box given that these are two separate models that have not been trained together?

If this works, I’m going to port the operations to Rust to run on the CPU.

(I haven’t gotten this far yet, but do you foresee any blockers preventing the text -> linguistic features problem from being ported to run efficiently on a single-threaded CPU? I don’t see Fourier transforms or other expensive operations that are apparent in Griffin-Lim, but I might be underestimating the matrix math going on here.)

Thanks again for your help!

geneing commented 5 years ago

@echelon I recently spent a week trying to match mel spectrograms between https://github.com/geneing/Tacotron-2 and https://github.com/geneing/WaveRNN-Pytorch. There were a few subtle differences between the two: normalization and scaling, offset, fmin, and library used to produce mels. After a week of work I still couldn't match the two.

I ended up using tactron preprocessing to generate training data for both tactron and wavernn training. I then modified wavernn data loader to work with tacotron training data. It turned out to be much simpler and much more robust.

I haven't done benchmarking of tactoron synthesis on cpu. On one hand tactron generates linguistic features at the rate of 80 frames per second vs 16K for vocoder. On the other hand tacotron has ~30M parameters, vs ~0.5M in my reduced vocoder. However, this doesn't say much about the actual speed.

geneing / WaveRNN

End-to-end WaveRNN? #3