G-Wang / WaveRNN-Pytorch

Fatcord's Alternative WaveRNN (Faster training)
MIT License
126 stars 72 forks source link

mu-law and gaussian output tests #2

Open G-Wang opened 6 years ago

G-Wang commented 6 years ago

Since the original code is trained on 9-bit audio (i.e 512 softmax output), there're a lot of background static that is present and unfortunately learned by the model.

Two paths will need to be investigated:

  1. mu-law encoded audio (256 softmax output)

  2. Gaussian output, similar to what's been discussed in the Clarinet paper. Can also test other distributions with bounded support (i.e Beta).

G-Wang commented 6 years ago

Changed single gaussian output to beta distribution, which as bounded support and seems to train better, still need to do runs to test hyper-parameters such as learning rate schedule, sequence length (for teacher forcing training) and update code to be more modular.

This WaveRNN model trains in about 50 hours, at batch size of 32, with gtx-1070ti (not using up all the memory).

Generation speech is pretty fast (about 2000 samples/second). This is before any further optimization (i.e RNN weight pruning, etc). Can also probably speech up generation on a gpu with larger memory by generating different segments of the sentence (as long as they're longer than sequence length) in parallel, since the model doesn't use tons of memory ( about 1.5 GB GPU ram).

preliminary samples: beta_wavernn.tar.gz

hdmjdp commented 5 years ago

HI, in your experiments, which loss is better? and in beta loss, what is your value that loss decrease to?

G-Wang commented 5 years ago

Hi, currently I would recommend using the raw bits (9 or 10 bits). These are the easiest and fastest to train. I think audio can sound pretty good with 10 bit raw output, with maybe some post-processing work on the speech. Interestingly I had more trouble with the mulaw encoded output (I tried mulaw with 256 and 512 output dim) than raw bit output.

In terms of loss on Beta/Gaussian, the best average loss I've gotten is around -6.6. However this still exhibit some weird sharp noise in the speech, as you can hear from my beta distribution samples. I'm hoping this is just an issue of the training hyper-parameters, which takes a lot of testing to get optimal.

However it could also be that the model is too simple for getting it work with the more difficult outputs like Beta/Gaussian. The only models I've seen gaussian working on in open-source are the wavenet variant models, which have alot more parameters and fancy tricks than this current model.

hdmjdp commented 5 years ago

@G-Wang The generated audio of the mulaw loss is bad?

G-Wang commented 5 years ago

@hdmjdp yes, not as good as raw bits.

Currently the audio quality and training speed is from best to worse:

raw bits (9 or 10 bits) > mulaw > gaussian/beta > mixture of logistic (this doesn't really work).

hdmjdp commented 5 years ago

@G-Wang yes, in my experiments, the same as yours. but mol also can generate wav.

hdmjdp commented 5 years ago

@G-Wang Do you test the fc and rnn layer without upsample and resnet? can this module realtime in cpu?

G-Wang commented 5 years ago

@hdmjdp I haven't tried architecture change to see if the network works without upsampling. But that would be an option to investigate.

As for real time synthesis (on GPU only), I have implemented partly batched synthesis here: https://github.com/G-Wang/WaveRNN-Pytorch/blob/aa46ff3c4c199c152d50f047346f38f059b47b0c/model.py#L226

you can take the mel spectrogram from for single sentence (say dimension 80 x 300), split it and batch it up (batch size depends on your GPU, say batch size of 10, so dimension 30 x 80 x 10), pass this batched mel (numpy array) through the batch_generate function to generate a batch of wav, and then join the batch of wavs together into a single wav. This will provide a speed up, in my test i used a batch size of 4 and 6 and i can get close to real time synthesis. (13 seconds for 9 second audio). With a higher batch size real time may be possible.

I'm currently wrapping up exams so I won't have time to implement this fully until December, but feel free to give this a crack. I think the batch synthesis audio will sound fine because our model is only training on a very small sample length (hop_size x seq = 256 x 5).

I'm also looking into having mel overlaps when we batch to avoid any audio issues for wav generated at the edges of mels.

hdmjdp commented 5 years ago

@G-Wang thanks very much, I will try it.

hdmjdp commented 5 years ago

@G-Wang I use "batch_generate", but got noise, what is wrong?

G-Wang commented 5 years ago

@hdmjdp did you ensure the melspectrogram is normalized between 0 and 1? you can see the audio.py to see how melspectrogram is normalized. if you can, try using the same melspectrogram that it was trained on.

I will take a look on my end to see where the bug is.

hdmjdp commented 5 years ago

@G-Wang my code has bug. I have fix it. I want to implement the inference code with c. the rnn and fc computation is 17Gflops, i am not sure whether it can run in realtime in cpu.

G-Wang commented 5 years ago

You probabaly have to prune the RNN networks like they do in WaveRNN paper to get a small model.

erogol commented 5 years ago

@G-Wang I've tried gaussian output at my branch https://github.com/erogol/WaveRNN and here is the result on LJSpeech https://soundcloud.com/user-565970875/gaussian-wavernn

the intonation is different than it's supposed to be and there is static noise for the parts of silences. Do you have any idea to go over this?

Now I plan to go with a mixture of logistic noise assuming the distribution is more complex than to approximate with a single Gaussian.

My other guess is that the variance bound is too large since it causes sample values all over the place with a single Gaussian

There is the gaussian branch https://github.com/erogol/WaveRNN/tree/gaussian

For using mu-law, I was not able train it at all. So model did not converge. So basically, I read a raw wav file apply mulaw compression and train the network all the same. And apply inverse mu-law after the whole audio is generated. Is there anything I am missing here. https://github.com/erogol/WaveRNN/tree/mulaw

Would be great to hear your experience.

G-Wang commented 5 years ago

@erogol The low intonation is rather strange and it's the first I've seen. perhaps you haven't trained it enough? I recall for my gaussians I trained for many many many steps, like 500k to 1 mil steps.

My experience with getting a single Gaussian to work is that the variance tends to grow large and quite a few many samples are way outside the range of -1 and 1 and had to be clipped. I tried fixing the variance as a constant, and learning only the Gaussian mean, however the result was still not great.

I ended up going with Beta distribution, since it has bounded support between [0,1] and I can just scale it to between -1 and 1. You can hear the samples (here)[https://soundcloud.com/gary-wang-23/sets/wavernn-samples]. You can hear that it still contain the static artifacts you mentioned, I still haven't figured out why.

As for mu-law, I tried many variations (including the one you mentioned where you apply mu-law during preprocessing of audio). However for some reason, the mu-law performance is not great. I still need to get around to investigate this, I think it has something to do with converting the mu-law discrete output to real values before feeding it to the RNN that has some issues.

At the end of the day, I've found that 10-bit audio in the original implementation style gives me the best samples. You can hear some for yourself (here)[https://soundcloud.com/gary-wang-23/sets/obama_bernie_fun]. And the model trains the fastest this way as well.

li-xx-5 commented 4 years ago

@G-Wang What is the appropriate value of batch size,it can synthesize in real time.If the value of batch size is large,whether the voice effect will decrease?

G-Wang commented 4 years ago

@doctor-xiang For realtime synthesis on GPU, you can refer to Fatcord's original WaveRNN repo, I believe he does batch synthesis and cross fade blending that can get close to real time on GPU.

If you want real-time on CPU, Refer to the fork of my repo here that has realtime inference on CPU. https://github.com/geneing/WaveRNN-Pytorch

I would also suggest you try non-autoregressive models like MelGan and Parrallel WaveGAN that can achieve real time much easier.

li-xx-5 commented 4 years ago

thank you, i know.

发件人:"GGANG" notifications@github.com 发送日期:2020-01-20 02:51:31 收件人:G-Wang/WaveRNN-Pytorch WaveRNN-Pytorch@noreply.github.com 抄送人:doctor-xiang xiang.li@dlssa.com,Mention mention@noreply.github.com 主题:Re: [G-Wang/WaveRNN-Pytorch] mu-law and gaussian output tests (#2)

@doctor-xiang For realtime synthesis on GPU, you can refer to Fatcord's original WaveRNN repo, I believe he does batch synthesis and cross fade blending that can get close to real time on GPU.

If you want real-time on CPU, Refer to the fork of my repo here that has realtime inference on CPU. https://github.com/geneing/WaveRNN-Pytorch

I would also suggest you try non-autoregressive models like MelGan and Parrallel WaveGAN that can achieve real time much easier.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

WelkinYang commented 4 years ago

Hey, how are you doing?
I've recently tried Gaussian distribution over multi-band wavernn. I find that my method doesn't work at all, I can't even get any audio except noise when softmax distribution can get good results. And I also noticed the problem of variance. Then, when i tried to clip the variance, i found that it is actually equivalent to fitting the distribution of samples with mse loss because of the very small variance. Do you think that's my problem with dealing with Gaussian distribution, or is Gaussian distribution really not suitable for modeling data that is actually discrete? (For example, 10bit [-32767, 32768], totally 65535 discrete numbers). I haven't tried beta distribution yet. Do you have any suggestions or concluding comments? If I can get your reply, I will be very happy. I wish you a happy life.

G-Wang commented 4 years ago

@WelkinYang sorry for late reply. You can see under my code's distributions.py where I've tried a few distributions, Gaussian, beta and mixture of logistics. The nice thing about beta over guassian is that beta distribution has a fix range support (range over which samples can be drawn from) from 0 to 1, which you can then scale to whatever range you'd like. With Gaussian, you will have to clip the samples to be within range since it technically has infinite support. Note I don't clip the variance, but simply clip the samples to be within range. In terms of actually ease of training, I've found learning good continuous distributions to be more difficult than discrete. I think mixture of logistics currently is the best for WaveRNN