fatchord / WaveRNN

WaveRNN Vocoder + TTS
https://fatchord.github.io/model_outputs/
MIT License
2.13k stars 696 forks source link

use wavernn to make TTS #1

Closed maozhiqiang closed 6 years ago

maozhiqiang commented 6 years ago

Can we use wavernn to make TTS, what is the input, thank you

fatchord commented 6 years ago

Hi there, I believe you can, in the paper that's what they said they did with it. They are using linguistic/pitch features as conditioning to wavernn.

Right now I'm trying to condition with upsampled mel spectrograms (like the tacotron2 vocoder) but the sound quality isn't great - I'm getting a whispering/gravelly sound, like the speaker has a really bad cold! I still have a couple more ideas to test out so if I have any success I'll upload it to the repo.

maozhiqiang commented 6 years ago

@fatchord Thank you for your quick reply,There are some points in the paper that I haven't fully understood. So I'm using your code to understand it!

fatchord commented 6 years ago

@maozhiqiang You're very welcome. Please keep in mind that there are some details (optimizer, learning rate, conditioning equations, signal splitting technique etc etc) that were left out of the paper so I've had to basically guess at what these are and improvise accordingly.

maozhiqiang commented 6 years ago

@fatchord Thank you for your detailed explanation!

naifrec commented 6 years ago

hi @fatchord! thanks for your implementation, puts some clarity to the equation (2) in the paper. There are a few things I am not sure I understand. First, it is stated that there are N=5 matrix multiplication, but I count 6 matrix multiplications:

  1. I x_t
  2. R h_{t-1}
  3. O_1 y_c
  4. O_2 relu(O_1 y_c)
  5. O_3 y_f
  6. O_4 relu(O_3 y_f

Now, talking about conditioning, all conditioning equations are left out, which feels pretty convenient because it basically allow them to have N=5. If you add the conditioning operations, this number should raise to at least 6 (or 7). I cannot see in your repository how you have tackled conditioning. I guess you are using a similar approach than in the WaveNet where you have a 1D 1x1 convolution on the conditioning variable, and the result is just added to the hidden state of the RNN?

fatchord commented 6 years ago

Hi there, I'm not 100% sure but the matrix multiplication for the inputs is quite small compared to the others in the equations so perhaps that's why it's left out of N.

As for conditioning, I've tried a couple of things - I've passed the conditioning features through a dense layer and then biased the gates. I've also tried concatenating them to the input. Both are working pretty much the same but the results are only okayish - the speech is intelligible but it has this trembling quality which might point to my upsampling network being incorrect . Have a listen: wavernn_conditioned.tar.gz

I'm currently training another model with different upsampling network & bigger dataset. If results are good I'll upload to the repo.

lifeiteng commented 6 years ago

@naifrec When Sampling, 1 & 2 can be merged, e.g. [R I ] * [h_{t-1} x_t]

lifeiteng commented 6 years ago

@fatchord very nice result, does them use ground-truth c_t? The conditioning formula is missed in the paper, maybe we should email with the authors.

I concatenate Linguistic(green line) / Acoustic (red line) features with x_t, training on cmu_us_slt_arctic: 2018-05-08 19 28 56

fatchord commented 6 years ago

@lifeiteng thanks, it's not a terrible result, at least it's working somewhat. However on the same dataset I get a smoother quality with wavenet(8bit): wavenet_gen_glr_dataset.tar.gz

I'm not sure what you mean by ground-truth c_t - do you mean conditioning features at time t?

By the way, I'd love to hear your wavernn samples so feel free to post them here!

lifeiteng commented 6 years ago

@fatchord I mean Are those samples synthesized only on Acoustic Features, not include ground-truth coarse & fine as inputs(like training).

Doing experiments now.

lifeiteng commented 6 years ago

BTW, you can try 9-bit or 10-bit u-law in WaveNet .

maozhiqiang commented 6 years ago

@fatchord hi! for mel as a local condition , do you use upsample ! how to as input of wavernn!

fatchord commented 6 years ago

@lifeiteng Re:generation - thanks for clarifying - those are synthesized with mel spectrograms only, the model didn't see any ground truth audio samples when generating.

Re: 9/10 bit wavenet - actually I've been wanting to try this out but my gpu has been busy with other experiments. Did you try it? Any success with it?

fatchord commented 6 years ago

@maozhiqiang I upsampled with a 2d conv transpose layer to keep the channels separate. Have a look at how kan-bayashi does it in his wavenet repo - it's basically the same. Regarding input/conditioning, you've a lot of options: you can concatenate it to the coarse and fine samples, you can transform it with a dense layer, split and bias the gates and another option I guess is to transform it and add to the hidden state after the gates have been computed. There might be other ways but these are the most obvious to me right now.

lifeiteng commented 6 years ago

@fatchord wavenet - I try 9-bit (e.g. 512 softmax) in wavenet, got much more clean result than 8-bit version. I saw a paper which claim 10-bit is better than 8-bit recently.

Now, I like wavernn more, because the training of wavenet needs several days GPU time to get reasonable result.

maozhiqiang commented 6 years ago

@fatchord thanks ! i well implement this like you said!

fatchord commented 6 years ago

@lifeiteng Sounds very interesting, have you got a link to the paper?

lifeiteng commented 6 years ago

@fatchord http://festvox.org/blizzard/bc2017/USTC-NELSLIP_Blizzard2017.pdf

As the original 8bit quantization introduced quantization
noise in synthetic speeches, we proposed to use a 10bit quantization
scheme instead, in order to alleviate this problem.
WaveNet with 3 blocks, which was 30 layers in total, was used
in our system.
fatchord commented 6 years ago

@lifeiteng Thanks! Nice read, I really love the idea of using a GAN to fix the smoothness from the mse loss.

shreyasnivas commented 6 years ago

Hi @fatchord - you've got some nice quality results there both from wavernn and wavenet!

Did you manage to overcome that trembling quality with your wavernn 198k sample? it's something we've been puzzling over too...

Our intuition points to the following areas

  1. The continuity of the RNN is broken by re-initializing the hidden state after 960 samples (and would point to a periodic tremble)
  2. The upsampling network may produce artifacts
  3. experimenting with different conditioning sites (i.e dense layer / concat at gates)
  4. If all else fails, more data..?

To preserve continuity, we are looking into TBPTT and longer input sequence lengths (like 2000 instead of 960) as well as increasing the length of our batches from the dataloader so we can carry forward the hidden states longer on each step.

Our upsampling is the same conv2d approach as in Kay Bayashi's wavenet. (we tried different stride/conv2d settings, like stacking smaller upsampling convs etc but found that the single 2d conv that upsamples the mel_spectrograms 'hop_length' times works best).

We're exploring different conditioning sites now.

The last appraoch is to train it on a huge dataset like LJSpeech and just wait it out, but our initial overfitting attempts on wavernn didn't seem to get rid of the jittery/trembling quality...

Did you find anything that worked? i.e wavernn comparable to wavenet quality?

cheers, shreyas

MlWoo commented 6 years ago

@lifeiteng Did u message the authors to make sense the conditioning formula?

lifeiteng commented 6 years ago

@MlWoo yes, but not got the response.

MlWoo commented 6 years ago

@lifeiteng Thanks a lot. The curves you show look good. Did you get them by upsampling the conditional features first to adapt the time resolution and then projecting them to bias the gates?

lifeiteng commented 6 years ago

That's a data feeding bug. Actually, I'm not lucky on wavernn.

MlWoo commented 6 years ago

@lifeiteng what a pity!

jjj8080 commented 6 years ago

@fatchord Is there any chance you could upload the code version you used to condition on mel-spectrograms?

tiberiu44 commented 6 years ago

Hi guys,

I've managed to implement a modified version of WaveRNN (I use two parallel RNN layers instead of modifying the cell) and slightly modified version of Tacotron to output conditioned spectograms. I tested the approach on Romanian (directly on characters) and I got some pretty ok results. You can check them out here (I also added e results obtained with HTS trained on the same corpus with lots of features extracted from text - syllables, phonetic transcription, POS, chunking etc.):

https://tiberiu44.github.io/TTS-Cube/

the project's repo is https://github.com/tiberiu44/TTS-Cube

I'm currently working on adding some more text-features (unsupervised) and on conditioning the model on multiple speakers.

Hope this helps, Tibi

MlWoo commented 6 years ago

@tiberiu44 Thanks a lot for your contribution. I have tried to train the vocoder with your code, But it is very slow to train the vocoder on 1080ti. I want to confirm the info because I am not familar to dynet framework.

tiberiu44 commented 6 years ago

Hi @MlWoo

It is slow, but it might be slower if you don't use the GPU version of DyNET. See this issue: https://github.com/tiberiu44/TTS-Cube/issues/2

You need to compile DyNET with CUDA support. Also run the training process for the vocoder with --use-gpu --autobatch and --batch-size=4000.

Similar settings go for the encoder (it ignores the batch-size at this point).

MlWoo commented 6 years ago

@tiberiu44 I did compile the dynet with cuda and run the model with cuda backend. I have not checked your model yet, maybe the model is large. Thanks for your reply.