Some questions about the emulator

JohannesBuchner commented 5 years ago

Hi,

this is a very cool project. I am interested in using some parts of it, in particular the emulator. If I understand correctly, I can use the pipeline to construct and regress the emulator for a d-dimensional model with existing N model evaluations. My questions are:

1) How do I call the emulator myself to obtain the estimated model output and its uncertainty? 2) Would it be possible for the emulator to compute gradients? (because it is a Gaussian approximation) 3) If I later obtain more samples (through my own external sampling process), how can I inform/update the pipeline?

I was also wondering if you know any packages of emulators using Gaussian processes? Have you thought of implementing them, what would the pros&cons be compared to the polynomial approach?

Cheers, Johannes

1313e commented 5 years ago

Hi @JohannesBuchner, thanks for your questions. I will try to answer them the best I can:

You can use the evaluate() method of the Pipeline class to evaluate a sample (set) in the emulator. Providing a single sample will print the results, while providing a sample set will return the results in a dict. This will give you the adjusted expectation and variance values, the implausibility values and the emulator iteration every sample was last evaluated at. Using the evaluate() method does perform a lot of checks, as it is a user-method. Therefore, for advanced use, it may be better to use the _evaluate_sam_set()-method, supplying the emul_i, your sam_set and using exec_code='evaluate'. This will return a tuple with all the results (adj_exp_val, adj_var_val, uni_impl_val, emul_i_stop, impl_check).
I am not entirely sure I understand what you mean here. The emulator can handle gradient fields in models perfectly fine, as they can also be described with polynomial terms. But, I do need a bit more information here to understand what you mean.
Updating the Pipeline with self-calculated samples can currently not be done for any iteration after the first. The reason for this is that the Pipeline can then not guarantee in any way that those samples should be used for evaluating the model and creating a new emulator iteration. The emulator is made in such a way that every emulator iteration is defined over the plausible region of the previous iteration. Therefore, it is important to pick the samples that are used for this, to ensure that that also happens. A constructed emulator however can be combined with a different sampling process by using the hybrid sampling functions given by the utils module.

Well, basically, most interpolation methods are already Gaussian-based, so that would be quite similar to this. I do not personally know of any public packages that implement a system like PRISM does. In PRISM, one can turn off the regression process, leaving only the Gaussian processes (by setting method to 'gaussian'). The reason why PRISM uses polynomial functions is that every model should have some underlying structure. Identifying this structure by using polynomial functions provides the user with much more information than explaining the covariance entirely with Gaussian processes.

I hope that answers your questions.

Cheers, Ellert

JohannesBuchner commented 5 years ago

Thank you for your answers!

Re 2): I was wondering if one could obtain the derivatives of the emulated model output w.r.t. the model parameters at a evaluated parameter set df(x) / dx_i. Some exploration methods can benefit from gradients, and having approximate gradients could be helpful.

Maybe I am confused: I was thinking of a model to return a single number at each position in parameter space (e.g., a loglikelihood function). Or is the model meant to be the prediction in data space?

1313e commented 5 years ago

Hi @JohannesBuchner,

Ah, now I understand what you mean. PRISM does not calculate the gradients (derivatives) of the emulated model output, as it does not require them. However, given that it does provide the user with the polynomial terms and their corresponding coefficients, I guess it would not be very difficult to obtain their derivatives. For that however to be accurate enough, it would be advised to make sure that the emulator is converged up to the point it cannot converge any further to avoid getting information that is not true. I do realize now that some MCMC methods require the gradient field of a model to exist (like Hamiltonian Monte Carlo) and for that it may be useful to have an approximate gradient field. I might think about that actually.

One has to be very careful here with the definitions of 'model', 'comparison data' and 'emulator'. A 'model' is any black box wrapped by a ModelLink subclass, that takes a parameter set and returns a list/array of data values corresponding to the requested data points (which are given by data identifiers). The 'comparison data' are the "real" values of these data points: The user want to find out what part of parameter space can generate model realizations that produce values that are very close to these "real" values. An 'emulator' is an approximation of the model, that is made to replace the process of evaluating the model, in order to significantly speed up the convergence process.

Therefore, the emulator gives an expectation (prediction) of the value of the MODEL that would be returned by the model if it was evaluated there. It therefore makes an approximation of the model for the specified data points, which becomes more and more accurate in the region of parameter space where the probability is high that a model realization can explain the comparison data. In regions of parameter space where the probability is low that a model realization exists that can explain the comparison data, the emulator's approximation will be very rough and not accurate.

Does that answer your questions?

1313e / PRISM

Some questions about the emulator #6