Closed matt-graham closed 1 year ago
On some further exploration of the documentation I found this relevant note in the VBMC FAQ, which also suggests my proposed approach of outputting a finite negative value as a proxy for negative infinity is not a good one.
Thanks for the comment! This is actually a non-trivial issue to deal with when working with surrogate models of the log-density.
The original VBMC algorithm does not deal with Inf
s and NaN
s, assuming the output of the log-density is always well-behaved. This is indeed a limitation, for now we should probably mention it more explicitly in the documentation (besides the FAQ), as you also suggest above:
it may be worth explicitly mentioning this is disallowed in the documentation
As for future plans, we do want to address the issues of zero-density (or NaN
s)
A naive workaround of outputting an arbitrary large negative number will break the GP surrogate, but a heuristic dynamical approach that sets this value based on the current observed values might work, combined with (fake) observation noise that informs (Py)VBMC not to take that observation too seriously (for the latter, see this preprint, Section 3.5 on noise shaping).
In an ideal world, we should use a logistic non-Gaussian likelihood for the GP model that informs the GP surrogate that the observation is below a threshold. However, introduction of non-Gaussian likelihoods requires approximate inference for the GP, which is a major modification of the algorithm and a separate research project in itself.
Thanks @lacerbi for the detailed explanation. Having something slightly more prominent in the documentation I think would be good though I did end up finding the relevant FAQ so not essential.
Out of interest, in terms of the possible longer term fixes, is the latter approach you suggest of informing the GP surrogate the target is below a threshold in a similar vein to that used by the authors of the GPry package / paper employ for this issue of simultaneously fitting a classifier (they use a support vector machine) which partitions the latent space into regions with well-defined target (log density) values and those with non-finite / non-defined values? That seemed an interesting idea but I haven't had a chance to try out GPry yet to see how well it works in practice.
Thanks @matt-graham - I am aware of the GPRy approach but we have not tried it out it - it is an interesting idea that might work in practice, and was planning to also give it a try.
Still, it remains an ad hoc heuristic since it doesn't really include the measurement into the GP surrogate; the SVM and the GP model are entirely distinct. For example, not sure how it avoids that whole areas of parameter space are ruled out just because one or two computations failed; or that the acquisition function would keep trying to go to the border of the unfeasible region (since it was a good region to explore before, it likely remains a good region to explore now; the GP has not changed, unless e.g. some additional heuristic "repulsion" term is added to the acquisition function; although the noise introduced by the batch acquisition might help).
The approach I was talking about would be more principled (but, of course, more costly), in that you do include the observation in the GP model, but instead of making it a standard observation with Gaussian likelihood, you make it a "censored" observation that just informs the model that "the observed value is somewhere below this threshold", where the threshold could be some low density value (but not insanely low). The point is that we just want to log-density to be very low, we don't care about its exact value. However, a censored-observation model is non-Gaussian so it requires approximate inference for the GP, which is a pain.
I have been thinking about this for a very long time (since the original VBMC paper more or less), but never got to it mostly because implementing it in MATLAB was out of the question (no autodiff
). Incidentally, a recent paper worked out a nice way to implement censored observations in GP models, exactly along the lines we have been thinking about (not for VBMC of course; they just present it for general GP regression): https://link.springer.com/article/10.1007/s11222-023-10225-3
PS: For now, we got to the stage of "unlocking" sparse variational GPs for VBMC in the preprint I linked above; this is a stepping stone towards more stuff (since once you can do approximate inference effectively, then you open up to more complex surrogate GP models).
Regarding the extraneous np.any
call, you are correct @matt-graham: It has been removed in pull #140.
Thanks @Bobby-Huggins and for the additions to the documentation making the requirements for the target log density function more visible. Closing this as now resolved from my perspective.
Raising issue as part of JOSS review https://github.com/openjournals/joss-reviews/issues/5428
The code block
https://github.com/acerbilab/pyvbmc/blob/b7c6e381be75dd508ac9663825fd77ed8885816d/pyvbmc/function_logger/function_logger.py#L148-L156
raises a
ValueError
when the value returned by the function is non-scalar, non-finite or non-real. When the function corresponds to the logarithm of the target density, this excludes the target density taking value zero, and so it's logarithm being negative infinity. I imagine this might be a constraint implied by using a Gaussian process to fit to the log target density evaluations, so this check probably makes sense, but as it implies a (weak) constraint on the class of posterior distributions that can be targeted it may be worth explicitly mentioning this is disallowed in the documentation, and/or suggesting workarounds (possibly outputting a large in magnitude but finite negative value as a proxy for negative infinity?). I imagine some common cases where users might want to use a target density which may take exactly evaluate to zero are: (i) when the prior has support only on a subset of the specified domain but this doesn't correspond to a simple Cartesian product of intervals as covered by the lower / upper bound options, (ii) observation noise from some truncated distribution is assumed which will mean the likelihood may evaluate to zero, (iii) the model involves a numerical solver which may fail to converge for some choices of the parameters, for which a common resolution would be to output a target density of zero to exclude these regions of parameter space from posterior (I hit against this in the latter case when trying out PyVBMC with a model which involves solving a system of ordinary differential equations for which the solver fails to converge for some parameter settings).As a sidenote, I might be wrong but I think the
np.any
call in the condition expression to theif
block is not needed.np.any
would usually be used on an array of boolean values. Here iff_val_orig
is not a scalar and sonot np.isscalar(f_val_orig)
evaluates toTrue
then because of the short-circuit evaluation semantics ofor
then the whole expression will evaluate toTrue
without evaluating thenot np.isfinite(f_val_orig)
andnot np.isreal(f_val_orig)
parts of the expression. If insteadf_val_orig
is a scalar then bothnp.isfinite(f_val_orig)
andnot np.isfinite(f_val_orig)
will output scalars I believe and the overall expression will therefore always evaluate to either (scalar)True
orFalse
without the need fornp.any
.