CliMA / EnsembleKalmanProcesses.jl

Implements Optimization and approximate uncertainty quantification algorithms, Ensemble Kalman Inversion, and Ensemble Kalman Processes.
https://clima.github.io/EnsembleKalmanProcesses.jl/dev
Apache License 2.0
80 stars 18 forks source link

EKP for general likelihoods? #262

Closed sdwfrost closed 1 year ago

sdwfrost commented 1 year ago

Is it straightforward to implement the algorithm described in https://www.sciencedirect.com/science/article/pii/S0167715222000967?

odunbar commented 1 year ago

I've had some discussion with @eviatarbach about this.

The key point is that, in the notation of the paper linked, they view a forward map H as deterministic.

In our code, we view the EKP forward map G can be nonlinear and/or non-deterministic (and non-Gaussian). For example you could view our forward map G in the paper notation as an evaluation theta -> p_D(z|H(theta)) where D is some unknown distribution and H as an unknown deterministic map.

If you do so, with perfect observation p(y|z)=z then you obtain exactly the Algorithm 1 in the paper. (NB We don't yet have the timestepping implementation for h)

The only non-default option you therefore require if you wanted to do something like this, is to be able to observe the non-deterministic map perfectly.

For this we have a keyword argument deterministic_forward_map::Bool = true in update_ensemble! (at least for EKI Inversion() update). If

  1. true, the form p(y|G(x)) is assumed Gaussian with given noise covariance obs_noise_cov
  2. false, the form p(y|G(x)) is assumed to be a delta function y = G(x)

Quick resolution

You should be able to run the code with the following alteration

Let us know how you get on!

Future actions

sdwfrost commented 1 year ago

Thanks @odunbar, this is super helpful. I have to deal with count (integer) data (eg number of new infections per day for an infectious disease), that I normally transform via log(x+1). In the standard framework, I get biased parameter estimates (populations are small, so my statistics are far from being Gaussian), though for an initial condition for another scheme, your EKP library is super useful.

odunbar commented 1 year ago

Thanks!

I should caveat that EKP still only has good theory for linear & Gaussian problems.

That is to say, although our parameter distribution setups and the algorithms do allow for more interesting priors and likelihoods by pushing nonlinearities/non-Gaussianities into the forward map, this can still yield biases, as EKP still only evolve a Gaussian approximation to perform optimization. In practice though it is has shown effective in many cases.

PS RE the count data, I am by no means an expert, but we do have some work in a filtering context, where we converted infection test-results into PPV/FOR data for data assimilation (c.f. "Synthetic Data" section). This is a different transformation that puts the data into [0,1], and might be of interest to you, though the boundaries can still prove challenging.

sdwfrost commented 1 year ago

Thanks for the feedback. Here is a writeup of my toy model using EKI and EKS. I'm pretty happy with how well it worked out - thanks!

odunbar commented 1 year ago

Awesome! Thanks for the write-up, I'm pleased it worked out for you, it's really great to hear.

If you ever work with a model you can't run >10^4 times for EKS, we do also have an ML-based pipeline CalibrateEmulateSample.jl (CES), currently just for moderate dimensional parameter O(10) , to accelerate obtaining approximate posterior distributions based on the output of EKI. In fact, for CES the posteriors do not need to be based on the Gaussianity that EKS requires, so they may also prove more flexible in capturing correlations.

[Please let me know if you are interested in trying this out too. It should hopefully just require a short script taking the priors and EKI ensemble as input]

odunbar commented 1 year ago

Closing as resolved.