acerbilab / pyvbmc

PyVBMC: Variational Bayesian Monte Carlo algorithm for posterior and model inference in Python
https://acerbilab.github.io/pyvbmc/
BSD 3-Clause "New" or "Revised" License
113 stars 6 forks source link

`ValueError` raised in `FunctionLogger` when target density is zero #137

Closed matt-graham closed 1 year ago

matt-graham commented 1 year ago

Raising issue as part of JOSS review https://github.com/openjournals/joss-reviews/issues/5428

The code block

https://github.com/acerbilab/pyvbmc/blob/b7c6e381be75dd508ac9663825fd77ed8885816d/pyvbmc/function_logger/function_logger.py#L148-L156

raises a ValueError when the value returned by the function is non-scalar, non-finite or non-real. When the function corresponds to the logarithm of the target density, this excludes the target density taking value zero, and so it's logarithm being negative infinity. I imagine this might be a constraint implied by using a Gaussian process to fit to the log target density evaluations, so this check probably makes sense, but as it implies a (weak) constraint on the class of posterior distributions that can be targeted it may be worth explicitly mentioning this is disallowed in the documentation, and/or suggesting workarounds (possibly outputting a large in magnitude but finite negative value as a proxy for negative infinity?). I imagine some common cases where users might want to use a target density which may take exactly evaluate to zero are: (i) when the prior has support only on a subset of the specified domain but this doesn't correspond to a simple Cartesian product of intervals as covered by the lower / upper bound options, (ii) observation noise from some truncated distribution is assumed which will mean the likelihood may evaluate to zero, (iii) the model involves a numerical solver which may fail to converge for some choices of the parameters, for which a common resolution would be to output a target density of zero to exclude these regions of parameter space from posterior (I hit against this in the latter case when trying out PyVBMC with a model which involves solving a system of ordinary differential equations for which the solver fails to converge for some parameter settings).

As a sidenote, I might be wrong but I think the np.any call in the condition expression to the if block is not needed. np.any would usually be used on an array of boolean values. Here if f_val_orig is not a scalar and so not np.isscalar(f_val_orig) evaluates to True then because of the short-circuit evaluation semantics of or then the whole expression will evaluate to True without evaluating the not np.isfinite(f_val_orig) and not np.isreal(f_val_orig) parts of the expression. If instead f_val_orig is a scalar then both np.isfinite(f_val_orig) and not np.isfinite(f_val_orig) will output scalars I believe and the overall expression will therefore always evaluate to either (scalar) True or False without the need for np.any.

matt-graham commented 1 year ago

On some further exploration of the documentation I found this relevant note in the VBMC FAQ, which also suggests my proposed approach of outputting a finite negative value as a proxy for negative infinity is not a good one.

lacerbi commented 1 year ago

Thanks for the comment! This is actually a non-trivial issue to deal with when working with surrogate models of the log-density. The original VBMC algorithm does not deal with Infs and NaNs, assuming the output of the log-density is always well-behaved. This is indeed a limitation, for now we should probably mention it more explicitly in the documentation (besides the FAQ), as you also suggest above:

it may be worth explicitly mentioning this is disallowed in the documentation

As for future plans, we do want to address the issues of zero-density (or NaNs)

A naive workaround of outputting an arbitrary large negative number will break the GP surrogate, but a heuristic dynamical approach that sets this value based on the current observed values might work, combined with (fake) observation noise that informs (Py)VBMC not to take that observation too seriously (for the latter, see this preprint, Section 3.5 on noise shaping).

In an ideal world, we should use a logistic non-Gaussian likelihood for the GP model that informs the GP surrogate that the observation is below a threshold. However, introduction of non-Gaussian likelihoods requires approximate inference for the GP, which is a major modification of the algorithm and a separate research project in itself.

matt-graham commented 1 year ago

Thanks @lacerbi for the detailed explanation. Having something slightly more prominent in the documentation I think would be good though I did end up finding the relevant FAQ so not essential.

Out of interest, in terms of the possible longer term fixes, is the latter approach you suggest of informing the GP surrogate the target is below a threshold in a similar vein to that used by the authors of the GPry package / paper employ for this issue of simultaneously fitting a classifier (they use a support vector machine) which partitions the latent space into regions with well-defined target (log density) values and those with non-finite / non-defined values? That seemed an interesting idea but I haven't had a chance to try out GPry yet to see how well it works in practice.

lacerbi commented 1 year ago

Thanks @matt-graham - I am aware of the GPRy approach but we have not tried it out it - it is an interesting idea that might work in practice, and was planning to also give it a try.

Still, it remains an ad hoc heuristic since it doesn't really include the measurement into the GP surrogate; the SVM and the GP model are entirely distinct. For example, not sure how it avoids that whole areas of parameter space are ruled out just because one or two computations failed; or that the acquisition function would keep trying to go to the border of the unfeasible region (since it was a good region to explore before, it likely remains a good region to explore now; the GP has not changed, unless e.g. some additional heuristic "repulsion" term is added to the acquisition function; although the noise introduced by the batch acquisition might help).

The approach I was talking about would be more principled (but, of course, more costly), in that you do include the observation in the GP model, but instead of making it a standard observation with Gaussian likelihood, you make it a "censored" observation that just informs the model that "the observed value is somewhere below this threshold", where the threshold could be some low density value (but not insanely low). The point is that we just want to log-density to be very low, we don't care about its exact value. However, a censored-observation model is non-Gaussian so it requires approximate inference for the GP, which is a pain.

I have been thinking about this for a very long time (since the original VBMC paper more or less), but never got to it mostly because implementing it in MATLAB was out of the question (no autodiff). Incidentally, a recent paper worked out a nice way to implement censored observations in GP models, exactly along the lines we have been thinking about (not for VBMC of course; they just present it for general GP regression): https://link.springer.com/article/10.1007/s11222-023-10225-3

PS: For now, we got to the stage of "unlocking" sparse variational GPs for VBMC in the preprint I linked above; this is a stepping stone towards more stuff (since once you can do approximate inference effectively, then you open up to more complex surrogate GP models).

Bobby-Huggins commented 1 year ago

Regarding the extraneous np.any call, you are correct @matt-graham: It has been removed in pull #140.

matt-graham commented 1 year ago

Thanks @Bobby-Huggins and for the additions to the documentation making the requirements for the target log density function more visible. Closing this as now resolved from my perspective.