bshall / VectorQuantizedVAE

A PyTorch implementation of "Continuous Relaxation Training of Discrete Latent Variable Image Models"
MIT License
71 stars 16 forks source link

Codebook perplexity #3

Closed pclucas14 closed 4 years ago

pclucas14 commented 4 years ago

Hi Ben,

hope all is well on your site of the globe. I have a general VQVAE question for you. I'm training VQVAE on larger images (mini-imagenet), and when I'm monitoring the perplexity, it never goes above 40-50, even though I have a codebook of size 512. In general, do you ever get a perplexity that's close to your codebook size ?

Thanks! -Lucas

bshall commented 4 years ago

Hi Lucas,

Yeah, not too bad on my side. Hope you're keeping safe as well.

Perplexity isn't a great measure of codebook usage because it depends on batch size. So if you have a relatively small batch size you might see low perplexities even thought actual codebook usage is higher. To get a better assessment I usually also keep track of the used code indices over an epoch. I can't remember the perplexities for this repo but with the speech model here, I got a final perplexity of about 240 with a codebook size of 512 and a batch size of 52. But all 512 codes are used in actuality.

Having said that, for some models/datasets it can be difficult to get good codebook usage. If you are in fact getting bad utilization I can suggest a few tricks to try. Just let me know.

Hope that help :)

pclucas14 commented 4 years ago

That's interesting. Yes I'd be curious to know what tricks you have to increase codebook usage

bshall commented 4 years ago

Sure. First, I found that batch normalization helps a lot. My guess is that batchnorm centers the output of the encoder at the origin and keeps their l2 norm relatively constant. This is important because at initialization the codebook is also centered at the origin. Next, at initialization, the magnitudes of the codes should be much smaller than the magnitudes of the outputs of the encoder. Lastly, I found that EMA updates improve codebook usage as well.

bshall commented 4 years ago

Have you seen this paper (we're both in the acknowledgments by the way :joy:)? They present some of the same advice and have a good explanation as to why initializing the codebook to small values helps.

They also propose increasing the learning rate for the codebook and periodic codebook re-initialization. Haven't tried those yet but they sound promising.

Let me know how your experiments go. I'm really interested in what works well across different datasets and domains.

pclucas14 commented 4 years ago

Wow that's so cool aha. Sounds good I'll keep you posted!

pclucas14 commented 4 years ago

So I tried batch norm, and surprisingly it didn't help in the setting I'm working in (although it may just be that the architecture I've kept iterating on is sort of tuned to work best without batch norm). I'll let you know if I find a magic solution

pclucas14 commented 4 years ago

actually this was most likely due to me using a small batch size

bshall commented 4 years ago

Thanks for the update @pclucas14. Good to know that it was the batch size.