How is the KL-divergence term for the network weights KL(qη(W) || p(W)) calculated practically?

pingguokiller commented 1 year ago

I've read your paper "Bayesian learning of neural network architectures". Thank you for sharing the code. I want to follow your paper.

I'm confused with the sentence "Also, given that the prior distribution p(W) is a Gaussian, the KL-divergence term for the network weights will be computed analytically and thus will reduce the variance in the gradient estimates." in the last paragraph of Section 2.1.

What is the meaning of "be computed analytically"? I don't know how the KL-divergence term for the network weights KL(qη(W) || p(W)) is calculated practically. And I also can not find the corresponding code in GitHub. It seems to be omitted. And why?

Can you help me?

gdikov commented 1 year ago

Hi @pingguokiller,

What is the meaning of "be computed analytically"?

The meaning of "to be computed analytically" is that there is a closed form solution for the integral so we don't need to approximate it with MC sampling. Alternatively, one can sample the approximate posterior and compute the average log-ratio between it and the prior under those samples. Quick googling leads to this step-by-step guide on how to do both analytic and sampling-based KL-div estimation for Gaussians.

And I also can not find the corresponding code in GitHub. It seems to be omitted. And why?

Since this work was done during an internship in a company which didn't permit me to open source it I just created those notebooks to show the gist of it. The code does not reproduce all experiments from the paper but rather shows the mechanism of learning layer size and network depth in an MLP. Using this, extending to convolutional layers or Bayesian weights networks should be straight-forward.

Cheers, Georgi

pingguokiller commented 1 year ago

Thanks for your kind interpretation. This helped me a lot.

gdikov / bayesian-architecture-learning

How is the KL-divergence term for the network weights KL(qη(W) || p(W)) calculated practically? #3