It seems the result was full of noise, maybe the variance was to large?

dhgrs / chainer-VQ-VAE

A Chainer implementation of VQ-VAE.

82 stars 19 forks source link

It seems the result was full of noise, maybe the variance was to large? #1

Closed HudsonHuang closed 5 years ago

HudsonHuang commented 6 years ago

Hi, that's nice work! But it seems the result got little improvement after trainning for 13x of time. How was the loss in trainng and how was the variance of output changing?

dhgrs commented 6 years ago

How was the loss in trainng

I keep training and the loss is decreasing now. loss1(cross entropy of WaveNet) loss1

loss2(StraightThrough) loss2

how was the variance of output changing

Sorry, but it takes too long time for generating so I tried generating a few times. If you have GPUs, please try training with several parameters in opt.py. Maybe n_channels are effective but it use a lot of memory.

HudsonHuang commented 6 years ago

Thank you for your reply, Lets's try together.

dhgrs commented 6 years ago

I notice some tips.

trimming silence works good(via librosa.effects.trim)
- I use librosa.effects.trim for trimming.
larger network can get lower loss but it takes too long time to train it
- I tried n_channel1 = 256 and n_channel2 = 512 in opt.py.

I could get clearer results but the results lost phoneme. Maybe VQ are too small to represent phoneme. I think that Larger d (and k) in opt.py may get good results.

agangzz commented 6 years ago

good job about VQ-VAE. But the output seems full of noise, so I have some questions: How many hours in your training dataset? How much time in your training processing?( 1 gpu, or several gpus?). Thanks

dhgrs commented 6 years ago

I use VCTK-Corpus and it has 109 speakers. Each speaker reads out about 400 sentences so total length is about 24h(as each sentence is 2 sec). Training time is written in README.

I use a model trained 40000 iterations. It takes about 26.5h for training with 1x 1080Ti.

dhgrs commented 6 years ago

I'm training VQ-VAE on CMU ARCTIC with a7244c5. Here are some samples on 160k iteration.

input: http://nana-music.com/sounds/037eb33f/

reconstructed: http://nana-music.com/sounds/037eb39a/

voice conversion: http://nana-music.com/sounds/037eb451/

After 200k iteration, I'll upload samples to README too!

dhgrs commented 6 years ago

I trained 200k iteration on CMU ARCTIC but then it lost phonemes. I think CMU ARCTIC is too small to train VQ-VAE. And I traind 200k iteration on VCTK Corpus too. Althogh it seemed to have some noises, didn't lose phonemes.

If you want to get more clear sounds, you should use larger network than opt.py or larger dataset than CMU ARCTIC.