Automatic marginalisation of hyperparameters

acerbilab / relational-neural-processes

Practical Equivariances via Relational Conditional Neural Processes (Huang et al., NeurIPS 2023)

MIT License

2 stars 0 forks source link

Closed lacerbi closed 1 year ago

lacerbi commented 1 year ago

We discussed this in the meeting but forgot to add it as a task.

Show that when training a RGNP on draws from a GP with varying hyperparameters, it learns effectively a posterior distribution.
See Figure 4 and Section 5.1 of the neural diffusion process (NDP) paper to see how to do it, we can replicate exactly the same or some variation thereof (note that the OpenReview version is more recent than the arXiv version).
Ideally we show that a RGNP does well here, while a standard GNP struggles (we should see whether an AttnGNP does okay).
I mentioned RGNP / GNP / AttnGNP because the standard (R)CNP does not generate correlated predictions, needed here to figure out e.g. the length scale of the learnt process. (Although we could use the autoregressive procedure of the AR-CNP paper.)

It shouldn't be too hard to code up once we have an implementation of RGNP.

manuelhaussmann commented 1 year ago

TODO

[X] Implement an experimental setup similar to NPD due: 05-03
[x] Show that training RGNP with varying hyperparamters effectively learns a posterior due:05-07
[x] Compare to GNP and AttnGNP due:05-07
[ ] Finetune the setup and make pretty figures depending on the evaluation outcome due:05-10

Status

The approach works in general, but has not yet achieved the very narrow histogram Dutordoir et al. claim to achieve.
Whether it improves still depends on seeds and hyperparameters, usually AGNP and RGNP improve upon GNP
Opinion: In its current setup it is more of a toy example. Could be more interesting if we switch to a quantitative comparison of ease to marginalize hyperparameters (final performance, computational cost,...)

manuelhaussmann commented 1 year ago

Notes and Comments

Dutordoir et al., claim to choose $l \sim LN(\sqrt 0.5, 0.5)$ as the prior for their Matern32 samples yet that is not the density they show
Fitting the RNP sample with a GP is for now assumed to know the Matern kernel structure (no details in the paper otherwise)
As long as GNP, AGNP can handle this slightly richer set of functions, the lengthscale property should follow directly for them as well.

manuelhaussmann commented 1 year ago

Summary: Any *NP that can learn from such a Matern32 prior should perform the same and we won't gain anything.