casperkaae / parmesan

Variational and semi-supervised neural network toppings for Lasagne
208 stars 31 forks source link

Reproduce results from sec. 6.1 in "Variational inference using normalizing flows" #22

Open casperkaae opened 8 years ago

casperkaae commented 8 years ago

As discussed in #21 it would be nice to reproduce the results from sec. 6.1 in the "Variational inference using normalizing flows" paper by Rezende et al.

I would guess the approach is:

wuaalb commented 8 years ago

I was thinking something similar

Is there any reason to use squared error instead?

casperkaae commented 8 years ago

I agree with KL instead og squared error. Ether we can just set q0 to a standard normal or we could just fit the mean/sigma to U before optimizing the flow parameters?

I think q_true(z) = e^{-U(z)}/Z should be p_true(z) = e^{-U(z)}/Z and the loss: [phat_true(z)||qK(z)] = E_{zK~qK(z)}[log phat_true(zK) - log qK(zK)] = E_{z0~q0(z)}[-U(zK) - log(Z) - log q0(z0) + sum_k logdet J_k] \prop E_{z0~q0(z)}[-U(zK) - log q0(z0) + sum_k logdet J_k]

wuaalb commented 8 years ago

Yes, I think that makes sense. Maybe a good way would be to have eps ~ N(0, I) and then z0 = mu + exp(0.5*log_var)*eps and jointly optimize {mu, log_var} together with the NF parameters using the same KL-divergence loss as above.

I didn´t get the part about the sign changing, but I agree p(z) is a better name than q_true(z).

casperkaae commented 8 years ago

Are you up for testing it? Otherwise I'll try it out at some point, however that might first be next week er something like that

wuaalb commented 8 years ago

I think in the above equations the KL term was swapped and it should be KL[qK(z)||p(z)]..

However, even with this change I didn't have much luck. The initial distribution (standard normal) just gets compressed to one of the sides (which side depends a bit on optimizer, initialization, etc.).


I also tried a simpler case where p(z) is a diagonal covariance normal distribution with some manually set mean and variance, and I applied the flow f(z) = mu + sigma*z.. This does seem to work, except that I had some Theano-problems getting the transformed distribution out, and could only get the transformed samples out..


I wonder if maybe there is some issue with the NormalizingPlanarFlowLayer after all..

casperkaae commented 8 years ago

Ok thanks for the effort :)! so either we can't figure out how they ran the experiments or there is a bug in the NormalizingPlanarFlowLayercode.

Can you share the code you used to run the experiments with me?

wuaalb commented 8 years ago

The code is here

Maybe one interesting experiment would be to apply two planar transformations f(z) = z + u h(w^T z + b) and manually figure out some settings of {u, w, b} (there's a geometric interpretation of the transform) that "split" a spherical distribution into a bimodal distribution like in figure 1 of the paper. Then try learning that and seeing what goes wrong.

casperkaae commented 8 years ago

Thanks for the Code. Here's a few comments I think you need to subtract p(zK) = N(zK | 0,I) from the loss i.e. eq (20) in the paper?

if i do that with k=32 i get figure_1 I.e. one of the modes correct

wuaalb commented 8 years ago

Sounds interesting.. but I'm not completely sure what you mean; could you post the exact change? What's "the loss jf"?

I ran some more experiments; and fixed the 3rd plot (so 3rd and 4th plot should be identical). With nflows=8 I get some thing similar to you _nf2_nflows 8_winit normal 0

However, if I initialize with w=lasagne.init.Normal(mean=[1.0, 0.0]), it is able to split the distribution into a bimodal one. _nf2_nflows 8_winit 1 0

So, I think this at least means that NormalizingPlanarFlowLayer outputs the correct f(z) and log det |J|. So if it has a problem, it is probably in how the constraints are applied or in the initialization of the parameters; or (perhaps more likely) there's something wrong with the loss or optimization.

casperkaae commented 8 years ago

"I think you need to subtract p(zK) = N(zK | 0,I) from the loss i.e. eq (20) in the paper": Yes that was a bit cryptic :). I just meant that log p(x,zK) = log p(zK) + log p(x|zK) = log p(zK) + U_z(zK). I think your code is missing log p(zK)?

Your second plot looks very much like the paper - thats great. Maybe they just didn't completely specify how they initialized the params then. If I add log p(zK), use k=8 and w=lasagne.init.Normal(mean=0.0, std=1.0) i get and a loss of ≈2.15


wuaalb commented 8 years ago

Hmm.. but doesn't the equation you are referencing only make sense in the standard VAE setting, where the observed data x is generated by a random processing involving unobserved latent variables z?

Here there is no observed data x.. We are just trying to get our approximate distribution q_K(z) to match the target distribution p(z) by minimizing the KL-divergence between the two..

The KL-divergence uses p(z_K) in place of p(x, z_K) in eq. 20. Keeping in mind that the energy function U(z) relates to probability like p(z) = 1/Z e^{-U(z)}, we get log p(z_K) = -U(z_K) - log Z (the normalization constant log Z term is omitted from the loss function as we do not know it and it doesn't affect the optimization).

wuaalb commented 8 years ago

_nf2_nflows 32_winit uniform -1 1 _uinit normal mean 0 std 1 _rmspropmom_annealed_250epochs_edit

Getting a little closer, but it seems quite fiddly..

casperkaae commented 8 years ago

Sorry for the late response.

It looks really good. If you make a short example I'll be very happy to include it in the example section. I hope to get around to testing the norm-flow implementation more throughout on MNIST soon, but your results seems to indicate that it is working.

wuaalb commented 8 years ago

I wrote an email to Rezende about this and he kindly confirmed that this is in fact a pretty tricky optimization problem and that the parameters should be initialized by drawing from a normal distribution with small variance (ie. the transforms start out close to the identity map).

Unfortunately, I haven't still been able to find a single configuration that works well for all the problems, but for some it is OK (although not quite as good as the plots in the paper), e.g.

I'm a little short on time right now, but once I get a working example I don't mind contributing it to Parmesan.

If you try the MNIST example and you can afford to do multiple runs, I'd be interested it knowing if initializing the b parameter to const(0) vs. drawing from normal(std=1e-2) makes a difference (also initializing {u,w,b} with normal(std=1e-2) vs. normal(std=1e-3)).

justinmaojones commented 8 years ago

Hello, I have been working on reproducing the work in this paper as well. I found that, in both the synthetic examples and on MNIST, increasing the variance of the distribution from which parameter initializations are drawn was very helpful. For example, try Uniform(-1.5, 1.5).

Annealing was also quite helpful for the synthetic cases, and I have found some evidence that it is also helpful for MNIST. I also found iterative training helpful for MNIST (i.e. successively add each flow layer throughout training), though it wasn't helpful for the synthetic examples.

I would be interested to hear how any of this works for you.

yberol commented 7 years ago

@wuaalb would it possible for you to share your implementation that produced the above gfycat with me?

wuaalb commented 7 years ago

@yberol Yes, sure, I uploaded it here Not 100% sure the current settings produce that gif.. If not, the correct settings are probably somewhere in there, but commented out.

yberol commented 7 years ago

@wuaalb Thank you very much, really appreciate your help!

yberol commented 7 years ago

@wuaalb Thanks again for sharing your implementation, it guided me a lot as I re-implemented it using autograd. However, I have a question.

When I plot the histogram of the samples z_K, everything is as expected. You are also plotting q_K(z_K) on the uniform grid. To compute q_K(z_K) on the grid, wouldn't I need to follow the inverse flow and get z_0, z1, ..., z{K-1} that produced z_K? Right now, I compute z_K's when z_0 is the uniform grid, but then the z_K's are warped and the plot does not look very nice. I would appreciate it if you can clarify how you computed q_K(z_K) for me.

2beans 2sines

wuaalb commented 7 years ago

It's been like a year and a half, so I don't remember any of the details..

To me your plots look more or less OK, just with a white background and maybe some scaling issue (?).

I think in my code example this is the relevant code

This is to avoid the white background

cmap =

I think q_K(z_K) is computed like

log_qK_zK = log_q0_z0 - sum_logdet_J
qK_zK = T.exp(log_qK_zK)
weixsong commented 6 years ago


How many epochs do you need to train for each U(x) distribution?

My experiments, after 15 epochs, each epoch 10000 steps, still nothing sampled.

Regards, Wei

Chechgm commented 5 years ago

Hello, I have been working on reproducing the work in this paper as well. I found that, in both the synthetic examples and on MNIST, increasing the variance of the distribution from which parameter initializations are drawn was very helpful. For example, try Uniform(-1.5, 1.5).

Annealing was also quite helpful for the synthetic cases, and I have found some evidence that it is also helpful for MNIST. I also found iterative training helpful for MNIST (i.e. successively add each flow layer throughout training), though it wasn't helpful for the synthetic examples.

I would be interested to hear how any of this works for you.

@justinmaojones, how do you train for MNIST if you don't know the target distribution? would you be so kind to share the (pseudo) code?

vr308 commented 4 years ago

@wuaalb Thanks again for sharing your implementation, it guided me a lot as I re-implemented it using autograd. However, I have a question.

When I plot the histogram of the samples z_K, everything is as expected. You are also plotting q_K(z_K) on the uniform grid. To compute q_K(z_K) on the grid, wouldn't I need to follow the inverse flow and get z_0, z1, ..., z{K-1} that produced z_K? Right now, I compute z_K's when z_0 is the uniform grid, but then the z_K's are warped and the plot does not look very nice. I would appreciate it if you can clarify how you computed q_K(z_K) for me.

2beans 2sines

Hi @yberol did you manage to figure out how to get around the warping issue of the transformed grid?