JavierAntoran / Bayesian-Neural-Networks

Pytorch implementations of Bayes By Backprop, MC Dropout, SGLD, the Local Reparametrization Trick, KF-Laplace, SG-HMC and more
MIT License
1.82k stars 304 forks source link

about Gaussian prior #3

Closed ShellingFord221 closed 4 years ago

ShellingFord221 commented 4 years ago

Hi, in bbp_homoscedastic.ipynb, it seems that you choose a normal Gaussian prior rather than a scale mixture prior. I think that mixture Gaussian prior can better model the real distribution of weight w. Thanks!

stratisMarkou commented 4 years ago

Thanks for the observation! It is true that tweaking your priors can lead to improved performance and we'd like to have these experiments for regression too. Remember however that for a mixture prior, the KL divergence integral becomes intractable, so you'll need to do MC to evaluate it, which will increase the variance of the ELBO MC-estimate.

ShellingFord221 commented 4 years ago

Thank you very much! Now I have another question about the prior. In the paper Weight Uncertainty in Neural Network, section 3.3, a simplified mixture Gaussian with two components is proposed to make the prior more amenable during optimising. But the posterior distribution is still assumed as a normal Gaussian, then how to use a normal Gaussian to approximate a mixture Gaussian?? From an implementation perspective, we can still do that--we sample w from its mu and sigma, then calculate kl divergence between p and q. But from a mathematical perspective, it is just meaningless! You can never use a normal Gaussian to approximate a mixture Gaussian, since they have different shapes and difference expression of mean and variance. So I think that making prior mixture is better for the performance, but posterior distribution should also be adjusted to fit the form of prior. Maybe I'm wrong, but I'll be very grateful if you can have a discussion with me. Thanks!

ShellingFord221 commented 4 years ago

Besides, shoud pi in the following formula a parameter or a fixed value, just like sigma1 and sigma2? image (I think that it should be a parameter to turn the prior more flexible during training, but I saw other repos make it a fixed value like 0.5)

stratisMarkou commented 4 years ago

Answer: I think you are misunderstanding the role of the variational posterior. Its role is not to model the prior, but to approximate the true posterior. I also understand the frustration, but before rushing to declare something as meaningless, you should have a read through the literature.

Details: As explained in #4, the goal in inference for BNNs is to approximate the posterior predictive integral p(y | x, data) = \int p(y | x, w) p(w | data) dw and the model evidence p(data | architecture, prior-parameters). There are two difficulties with these:

  1. The integral is intractable. We need to somehow approximate it, hence why we use MC as explained in #4. There are other approaches to evaluating this integral such as the Laplace approximation (also available in our repo).
  2. If you want to use MC, you need to sample from p(w | data), but this is also a hard problem in general. Therefore the BBP paper makes a further approximation falling under the family of approximations called Variational Inference (see Bishop's book referenced in #4 and the paper you referenced). VI approximates p(w | data) by an approximate variational posterior q(w | theta), and learns what the theta parameters should be from the data, in order to make q(w | theta) be similar to p(w | data). The variational posterior approximates the true posterior and not the prior. Further, the variational posterior can be whatever you like it to be, so we can choose something we can sample from, such as a Gaussian.

Now, the measure of closeness of q(w | theta) to p(w | data) can also be whatever you like, but there are good reasons why it should be the KL divergence KL(q(w | theta) || p(w | data)) - takes values from 0 to +infinity, and 0 is attained if q(w | theta) = p(w | data). Therefore, the variational approach tries to optimise the ELBO = p(data | architecture, prior-parameters etc.) - KL(q(w | theta) || p(w | data)). This turns out to be equivalent to maximising \int p(data | w) q(w | data) dw - KL(q(w | theta) || p(w)). Therefore the model tries to learn a q which will model the data well but is also close to the prior in terms of KL. The Gaussian variational posterior does not try to approximate the scale mixture prior, but is only incentivised to be close to it by the KL term. Your intuition is close to what is really happening but you are missing the main point about VI. I would recommend reading Bishop's sections on VI and the Weight Uncertainty Paper. With regards to your question on learning priors, yes learning pi would make the prior more flexible, David Mackay's thesis explores this in detail.