bjlkeng / sandbox

Play time!
MIT License
194 stars 71 forks source link

VAE on a huge dataset #6

Open anne1994 opened 5 years ago

anne1994 commented 5 years ago

I read your post on VAE. I am using pytorch resnet50 architecture as my encoder and have used a simple 5 layer transpose convolutions to get to the input dimension for decoder. The reconstructed image is noisy. There is not much change in the loss. I see that you have a VAE-resnet implementation, will that work for a huge data set with millions of images? I read another post of yours which talks about PCA trick. How is VAE resnet implemention results different from that of PCA trick implementation ? . I will also try with a pretrained resnet50 on imagenet to see whats the issue. Basically i am unable to rule out whether the problem is with the encoder, decoder or the dataset size. Any suggestion would help. Thanks.

bjlkeng commented 5 years ago

Hi @anne1994 , I actually don't know if it will work with millions of images. Theoretically, there's no reason it can't. I would say my experience with vanilla VAEs is that they aren't great a reconstructing images. There a dozens of papers on the subject but it probably has something to do with the latent variables and the loss function. I do recall reading some extensions that can do better but I haven't seen vanilla VAEs reconstruct non-blurry images.

The PCA trick seemed to work well enough for SVHM dataset. The main idea is that you fit a VAE on the PCA transformed data. For some reason, it's the only way I could get semi-reasonable results on SVHM. Although my intuition is that it's probably very dataset dependent. Also, I'm not sure a pre-trained resnet would help on this problem since we're not doing image classification.

My main question would be what you're trying to achieve. If it's to generate realistic results, GANs are by far the state-of-the-art in this domain. If you just want to learn more about VAEs, then you really don't need to be using 1M image datasets, smaller ones will do just fine. My posts are more in the latter bucket of things, I'm very interested in probabilistic generative models.

Hope that helps.

anne1994 commented 5 years ago

My goal is to use those embeddings generated by the encoder to get similar items. when i check the vanilla vae (fc layers) , tsne looks fine. But for conv layers i dont get similars with nearest neighbor search. i wonder why ? If we have two or three inputs which are very similar, does encoder represent them close to each other in latent space ?

bjlkeng commented 5 years ago

Theoretically, they should. However, if you just want to extract features from images, I would suggest using a pre-trained ImageNet and then taking features from an intermediate layer. This is probably not state of the art in representation learning, but it works pretty well for images.

The other point to keep in mind that vanilla VAEs don't do that well for complex image datasets. For example, for CIFAR10, they do pretty poorly. I imagine they would do the same for Imagenet, or some other million image dataset. Thus, I would expect that the representation they generate to be quite poor, which might explain what you're seeing. There are probably some extensions of VAEs that might be good for this but I don't have any direct experience with them.

Hope the helps!