aleximmer / Laplace

Laplace approximations for Deep Learning.
https://aleximmer.github.io/Laplace
MIT License
464 stars 72 forks source link

[theory] FunctionalLaplace - why defining GP on prior instead of posterior + which is the difference with FullLaplace? #249

Open smartArancina opened 1 week ago

smartArancina commented 1 week ago

Premise

sorry if some question ( or all ! :) ) could seems stupid / basic / totally wrong, I am in the process of self studying GP + improves my bayes knowledge + LA approx with NN and I am falling in love with this topics, but for sure I have to better understand them !

Description

In particular I am referring to the original paper: Improving predictions of Bayesian neural nets via local linearization

Before I show here the reasoning flow I understood

Weight space

From section 3.1 - 3.2 of the paper I understood that the use of GNN to approximate the weigths posterior is justified when we use the linerized model in the likelihood and this leads to a GLM

where the found approximate posterior is:

then the approximate posterior distribution

Function space (GP)

we define the fully gp posterior inference using GNN + GLM as follow, starting from prior in function space

Questions

1) Why we use the prior in the GP formulation instead of the approximated posterior found with GNN ? 2) which is the difference between FunctionalLaplaceand FullLaplace (maths formula a part) ? That is: why doing the posterior inference in the function space directly starting from the functional prior leads to different results ? I thought that the NN linearization + gaussian likelihood resulted in the GP formulation as shown in the image below (using linearity property of gaussians and the fact that they are closed under conditioning and marginalization + assuming a zero mean and diagonal prior on the weights). The FullLaplace implementation should follow the following approch if I understood correctly

3) In the paper you showed the using of approximating the weights posterior predictive distribution by sampling becouse we are assuming a general likelihood right ? Otherwise in case of a Gaussian likelihood we should have the following closed formula right (the one showed in question 2)) ?

aleximmer commented 6 days ago

Hi, these are great questions. I will try to clarify:

  1. You can either convert the prior into a GP and then do inference in the function space to get the corresponding posterior or, as you suggest, convert the weight-space posterior into its equivalent GP posterior. The result is the same since after linearization you can freely move between both views.
  2. FunctionalLaplace and FullLaplace are equivalent, see for example this test https://github.com/aleximmer/Laplace/blob/main/tests/test_functional_laplace.py#L33. So they should not lead to different results unless you start making approximations either to the parametric or functional posterior approximations, such as subset-of-data.
  3. Yes, for a Gaussian likelihood you have a closed-form posterior predictive when linearizing. This is also implemented in the library.

I hope this answers your questions, otherwise feel free to follow up.

smartArancina commented 1 day ago

Hi @aleximmer , thanks a lot for the ansewers these helped a lot. Yes, on question 2 I used a SoD on a my example obtaining the difference that confused me. So, happy to see that on the entire dataset FunctionalLaplace and FullLaplace should match. But now I have others questions ! :)

  1. When and why one should use FunctionalLaplace over FullLaplace in regression setting?
  2. Uncertainty interpretation: condering the exact log posterior Hessian and the model linearization we arrive to the following posterior predictive distribution image. is it correct saying that the uncertainty is represented by how much sensible (w.r.t. to weights) the linearized model at MAP point is at that input location (weighted by the loss Hessian --> how much constrained are the weights after training) ?

Thanks again.