Closed ShellingFord221 closed 4 years ago
Thanks for the observation! It is true that tweaking your priors can lead to improved performance and we'd like to have these experiments for regression too. Remember however that for a mixture prior, the KL divergence integral becomes intractable, so you'll need to do MC to evaluate it, which will increase the variance of the ELBO MC-estimate.
Thank you very much! Now I have another question about the prior. In the paper Weight Uncertainty in Neural Network, section 3.3, a simplified mixture Gaussian with two components is proposed to make the prior more amenable during optimising. But the posterior distribution is still assumed as a normal Gaussian, then how to use a normal Gaussian to approximate a mixture Gaussian?? From an implementation perspective, we can still do that--we sample w from its mu and sigma, then calculate kl divergence between p and q. But from a mathematical perspective, it is just meaningless! You can never use a normal Gaussian to approximate a mixture Gaussian, since they have different shapes and difference expression of mean and variance. So I think that making prior mixture is better for the performance, but posterior distribution should also be adjusted to fit the form of prior. Maybe I'm wrong, but I'll be very grateful if you can have a discussion with me. Thanks!
Besides, shoud pi in the following formula a parameter or a fixed value, just like sigma1 and sigma2? (I think that it should be a parameter to turn the prior more flexible during training, but I saw other repos make it a fixed value like 0.5)
Answer: I think you are misunderstanding the role of the variational posterior. Its role is not to model the prior, but to approximate the true posterior. I also understand the frustration, but before rushing to declare something as meaningless, you should have a read through the literature.
Details: As explained in #4, the goal in inference for BNNs is to approximate the posterior predictive integral p(y | x, data) = \int p(y | x, w) p(w | data) dw and the model evidence p(data | architecture, prior-parameters). There are two difficulties with these:
Now, the measure of closeness of q(w | theta) to p(w | data) can also be whatever you like, but there are good reasons why it should be the KL divergence KL(q(w | theta) || p(w | data)) - takes values from 0 to +infinity, and 0 is attained if q(w | theta) = p(w | data). Therefore, the variational approach tries to optimise the ELBO = p(data | architecture, prior-parameters etc.) - KL(q(w | theta) || p(w | data)). This turns out to be equivalent to maximising \int p(data | w) q(w | data) dw - KL(q(w | theta) || p(w)). Therefore the model tries to learn a q which will model the data well but is also close to the prior in terms of KL. The Gaussian variational posterior does not try to approximate the scale mixture prior, but is only incentivised to be close to it by the KL term. Your intuition is close to what is really happening but you are missing the main point about VI. I would recommend reading Bishop's sections on VI and the Weight Uncertainty Paper. With regards to your question on learning priors, yes learning pi would make the prior more flexible, David Mackay's thesis explores this in detail.
Hi, in bbp_homoscedastic.ipynb, it seems that you choose a normal Gaussian prior rather than a scale mixture prior. I think that mixture Gaussian prior can better model the real distribution of weight w. Thanks!