dilated conv before or after vq?

dhgrs / chainer-VQ-VAE

A Chainer implementation of VQ-VAE.

82 stars 19 forks source link

dilated conv before or after vq? #8

Closed barby1138 closed 5 years ago

barby1138 commented 5 years ago

Hi, I try to understand why this block is applied after the vq? If it will be applied before what will happen? My aim is to have the representation of more high level features.

    local_condition = F.relu(self.local_embed1(local_condition))
    local_condition = F.relu(self.local_embed2(local_condition))
    local_condition = F.relu(self.local_embed3(local_condition))
    local_condition = F.relu(self.local_embed4(local_condition))
    local_condition = F.relu(self.local_embed5(local_condition))

dhgrs commented 5 years ago

Hi,

Dr. Zen who is one of the WaveNet's author tweeted in Japanese: https://twitter.com/heiga_zen/status/997194535087165441

In English: We apply 5 layers of non-causal dilated convolutions for local conditionning. It is like Encoder-Decoder model; the encoder is non-causal dilated convolutions and the decoder is WaveNet.

My understanding is that non-causal dilated convolutions(self.local_embeds) make VQed high level (abstracted) features low level abstracted.

barby1138 commented 5 years ago

Hmm interesting, but from my observations: I ve checked your implementaton of VQ-VAE and it can transfer voice style between voices from dataset (the same voice better), but if the voice is not from the dataset - the quality is much lower. From this I can ussume that its VQ-VAE representation is not enough abstract (high level). Also check this doc https://arxiv.org/pdf/1807.11470.pdf page 11. seems they use some dilation conv layers before vq. Any comments / ideas?

barby1138 commented 5 years ago

Actually, for my understanding, bytenet is not exactly just sequential dilated convs ...

dhgrs commented 5 years ago

Yes, my model may be under-fitting so can not get good quality representation.

Many Deep Learning papers does not have reproducibility so I'm interested in whether the paper is written enough to reproduce the results. So in this repository, I want to reproduce WTHAT WRITTEN IN PAPER, not to reproduce demo page's quality. In this point of view I use 4 layers convolution as the encoder.

If you want to repcoduce demo page's quality, I don't have many ideas... But I think the architecture of encoder is a good point. Larger encoder (before or after VQ) may help training. And if the encoder is trained enough, the WaveNet size can be more smaller. In ClariNet paper, 20 layers WaveNet with smaller channels is used.

barby1138 commented 5 years ago

Thank you very much for the help and great job. Have a nice time :)