Closed maozhiqiang closed 6 years ago
Hi there, I believe you can, in the paper that's what they said they did with it. They are using linguistic/pitch features as conditioning to wavernn.
Right now I'm trying to condition with upsampled mel spectrograms (like the tacotron2 vocoder) but the sound quality isn't great - I'm getting a whispering/gravelly sound, like the speaker has a really bad cold! I still have a couple more ideas to test out so if I have any success I'll upload it to the repo.
@fatchord Thank you for your quick reply,There are some points in the paper that I haven't fully understood. So I'm using your code to understand it!
@maozhiqiang You're very welcome. Please keep in mind that there are some details (optimizer, learning rate, conditioning equations, signal splitting technique etc etc) that were left out of the paper so I've had to basically guess at what these are and improvise accordingly.
@fatchord Thank you for your detailed explanation!
hi @fatchord! thanks for your implementation, puts some clarity to the equation (2) in the paper. There are a few things I am not sure I understand. First, it is stated that there are N=5 matrix multiplication, but I count 6 matrix multiplications:
Now, talking about conditioning, all conditioning equations are left out, which feels pretty convenient because it basically allow them to have N=5. If you add the conditioning operations, this number should raise to at least 6 (or 7). I cannot see in your repository how you have tackled conditioning. I guess you are using a similar approach than in the WaveNet where you have a 1D 1x1 convolution on the conditioning variable, and the result is just added to the hidden state of the RNN?
Hi there, I'm not 100% sure but the matrix multiplication for the inputs is quite small compared to the others in the equations so perhaps that's why it's left out of N.
As for conditioning, I've tried a couple of things - I've passed the conditioning features through a dense layer and then biased the gates. I've also tried concatenating them to the input. Both are working pretty much the same but the results are only okayish - the speech is intelligible but it has this trembling quality which might point to my upsampling network being incorrect . Have a listen: wavernn_conditioned.tar.gz
I'm currently training another model with different upsampling network & bigger dataset. If results are good I'll upload to the repo.
@naifrec When Sampling, 1 & 2 can be merged, e.g. [R I ] * [h_{t-1} x_t]
@fatchord very nice result, does them use ground-truth c_t
?
The conditioning formula is missed in the paper, maybe we should email with the authors.
I concatenate Linguistic(green line) / Acoustic (red line) features with x_t
, training on cmu_us_slt_arctic:
@lifeiteng thanks, it's not a terrible result, at least it's working somewhat. However on the same dataset I get a smoother quality with wavenet(8bit): wavenet_gen_glr_dataset.tar.gz
I'm not sure what you mean by ground-truth c_t
- do you mean conditioning features at time t?
By the way, I'd love to hear your wavernn samples so feel free to post them here!
@fatchord I mean Are those samples synthesized only on Acoustic Features, not include ground-truth coarse & fine as inputs(like training)
.
Doing experiments now.
BTW, you can try 9-bit or 10-bit u-law in WaveNet
.
@fatchord hi! for mel as a local condition , do you use upsample ! how to as input of wavernn!
@lifeiteng Re:generation - thanks for clarifying - those are synthesized with mel spectrograms only, the model didn't see any ground truth audio samples when generating.
Re: 9/10 bit wavenet - actually I've been wanting to try this out but my gpu has been busy with other experiments. Did you try it? Any success with it?
@maozhiqiang I upsampled with a 2d conv transpose layer to keep the channels separate. Have a look at how kan-bayashi does it in his wavenet repo - it's basically the same. Regarding input/conditioning, you've a lot of options: you can concatenate it to the coarse and fine samples, you can transform it with a dense layer, split and bias the gates and another option I guess is to transform it and add to the hidden state after the gates have been computed. There might be other ways but these are the most obvious to me right now.
@fatchord wavenet - I try 9-bit (e.g. 512 softmax) in wavenet, got much more clean result than 8-bit version. I saw a paper which claim 10-bit is better than 8-bit recently.
Now, I like wavernn more, because the training of wavenet needs several days GPU time to get reasonable result.
@fatchord thanks ! i well implement this like you said!
@lifeiteng Sounds very interesting, have you got a link to the paper?
@fatchord http://festvox.org/blizzard/bc2017/USTC-NELSLIP_Blizzard2017.pdf
As the original 8bit quantization introduced quantization
noise in synthetic speeches, we proposed to use a 10bit quantization
scheme instead, in order to alleviate this problem.
WaveNet with 3 blocks, which was 30 layers in total, was used
in our system.
@lifeiteng Thanks! Nice read, I really love the idea of using a GAN to fix the smoothness from the mse loss.
Hi @fatchord - you've got some nice quality results there both from wavernn and wavenet!
Did you manage to overcome that trembling quality with your wavernn 198k sample? it's something we've been puzzling over too...
Our intuition points to the following areas
To preserve continuity, we are looking into TBPTT and longer input sequence lengths (like 2000 instead of 960) as well as increasing the length of our batches from the dataloader so we can carry forward the hidden states longer on each step.
Our upsampling is the same conv2d approach as in Kay Bayashi's wavenet. (we tried different stride/conv2d settings, like stacking smaller upsampling convs etc but found that the single 2d conv that upsamples the mel_spectrograms 'hop_length' times works best).
We're exploring different conditioning sites now.
The last appraoch is to train it on a huge dataset like LJSpeech and just wait it out, but our initial overfitting attempts on wavernn didn't seem to get rid of the jittery/trembling quality...
Did you find anything that worked? i.e wavernn comparable to wavenet quality?
cheers, shreyas
@lifeiteng Did u message the authors to make sense the conditioning formula?
@MlWoo yes, but not got the response.
@lifeiteng Thanks a lot. The curves you show look good. Did you get them by upsampling the conditional features first to adapt the time resolution and then projecting them to bias the gates?
That's a data feeding bug. Actually, I'm not lucky on wavernn.
@lifeiteng what a pity!
@fatchord Is there any chance you could upload the code version you used to condition on mel-spectrograms?
Hi guys,
I've managed to implement a modified version of WaveRNN (I use two parallel RNN layers instead of modifying the cell) and slightly modified version of Tacotron to output conditioned spectograms. I tested the approach on Romanian (directly on characters) and I got some pretty ok results. You can check them out here (I also added e results obtained with HTS trained on the same corpus with lots of features extracted from text - syllables, phonetic transcription, POS, chunking etc.):
https://tiberiu44.github.io/TTS-Cube/
the project's repo is https://github.com/tiberiu44/TTS-Cube
I'm currently working on adding some more text-features (unsupervised) and on conditioning the model on multiple speakers.
Hope this helps, Tibi
@tiberiu44 Thanks a lot for your contribution. I have tried to train the vocoder with your code, But it is very slow to train the vocoder on 1080ti. I want to confirm the info because I am not familar to dynet framework.
Hi @MlWoo
It is slow, but it might be slower if you don't use the GPU version of DyNET. See this issue: https://github.com/tiberiu44/TTS-Cube/issues/2
You need to compile DyNET with CUDA support. Also run the training process for the vocoder with --use-gpu --autobatch and --batch-size=4000.
Similar settings go for the encoder (it ignores the batch-size at this point).
@tiberiu44 I did compile the dynet with cuda and run the model with cuda backend. I have not checked your model yet, maybe the model is large. Thanks for your reply.
Can we use wavernn to make TTS, what is the input, thank you