Make likelihoods robust to noise-free input data

Balandat commented 5 years ago

We want to have models behave reasonably when fit to "sufficiently nice" data. Right now if you fit an Exact GP with a GaussianLikelihood to such target data (e.g. a constant or a linear function), the optimization will error out with the following insightful error message:

TypeError: InvQuadLogDet.forward: expected Variable (got float) for return value 1

This is because the most likely value for the noise parameter is zero (or close to it), which in both the exp and softplus parameterizations is approaching -inf, so the optimization runs intoNaN values sooner or later.

Not quite sure what the best way to deal with this would be. Ideally we'd enforce a minimal noise level directly in the optimizer (if it supports constraints). For the pytorch optimizers clamping would be an option.

Balandat commented 5 years ago

If we wanted to make this usable with external optimizers without having to manually write the full optimization loop, then we need to allow for bounds to be registered with parameters, so that they can be retrieved programmatically from the model. This could be done in a way similar to how priors are handled. The simple version of this would just be lower and upper bounds on the parameter values, although we could think of more general constraints down the road.

gpleiss commented 5 years ago

Yeah, in practice I think there is never a case that the noise needs to be less than 1e-5 or 1e-6. As a short term solution, we could make it so that the preconditioner clamps down the added noise value to be at most 1e-6? @jacobrgardner

jacobrgardner commented 5 years ago

My preferred way to handle this would be to start using add_jitter in the regression setting (but maybe on the order of 1e-6). By still allowing the actual log noise parameter to be unconstrained, this prevents the total noise from going under 1e-6, but also avoids any boundary problems with the gradient like we saw when we had parameter bounds a while back.

Edit: It’s also worth noting that concerns like this in theory are exactly the reason we have priors. The “right” way to handle the problem is arguably to place a SmoothBoxPrior with an appropriate lower bound.

gpleiss commented 5 years ago

This is a fair point. Perhaps we want to have a SmoothBoxPrior by default on GaussianLikelihood? Then the user has to put in some effort to make things unstable.

Balandat commented 5 years ago

So just using the prior can still lead to some finicky behavior when using optimizers that potentially take long steps (backtracking during line search doesn't help much, b/c if it evaluates for an initial large step size the damage is already done as gpytorch will error out with some NaN issue).

By adding a minimal jitter, we effectively achieve a lower bound. The only somewhat tricky thing to be aware of is the scaling of this bound - if the training data is standardized then you can use a reasonable value that works for everything, if not you might end up drowning out any signal that you may have.

The limitation of using the jitter is that it addresses this particular issue only. More generally, one might put other constraints on model hyperparameters, it would be nice to be able to use those explicitly in the optimization. I typically use a wrapper of scipy's L-BFGS-B for fitting models, which explicitly supports bounds on the parameters. Ideally the bounds could be stored with model rather than having to be specified externally.

What do you think about the following change to Module:

in the constructor, set self._bounds = OrderedDict()
add the following function:
def register_bounds(self, parameter, lower=None, upper=None):
if parameter not in self._parameters:
raise AttributeError(
"Unknown parameter {name} for {module}".format(
name=parameter, module=self.class.name
)
- "Make sure the parameter is registered before registering bounds."
)
self._bounds[parameter] = (lower, upper)

Then you could do something like

self.register_bounds("raw_noise", lower=self._inv_param_transform(torch.tensor(1e-4)).item())

in GaussianLikelihood.

This wouldn't change anything in terms of the current functionality. It would however, allow you to read out the bounds and pass them appropriately to an external optimizer. I have a draft implementation fo this that seems to work well.

jacobrgardner commented 5 years ago

The one concern I have with the proposed interface is that it essentially has no effect on anything unless the user explicitly leaves torch for optimization if I am reading this right. This is problematic if the user expects the bounds to work.

The other concern I have for this being a general addition is GPU support, since I assume it is impossible to interface with scipy without adding at least some CPU <> GPU communication.

Balandat commented 5 years ago

it essentially has no effect on anything unless the user explicitly leaves torch for optimization

Not quite, it's more like it doesn't do anything unless the user explicitly makes use of it. That could be via scipy or other optimizers, but it could be using torch.optim as well. E.g. the user could use these bounds to programmatically clamp parameters within their torch optimization loop without having to specify the parameter names explicitly in the loop. We could make sure in the docstring to be explicit that this is not automatically used, and maybe point to an example how it could be.

The other concern I have for this being a general addition is GPU support, since I assume it is impossible to interface with scipy without adding at least some CPU <> GPU communication.

This will be an issue only if you have to shove back and forth a bunch of data, i.e. if your model parameters are high-dimensional. If we're talking maybe a few dozen of parameters in the case of an exact GP this is pretty negligible compared to computing the for a set of test points. Regarding the implementation: I have some optimizer utilities the live outside gpytorch that (i) automatically transform a model to a scipy-minimize compatible description and (ii) update the model parameters during the optimization steps, making sure to retain the proper dtype and device.

gpleiss commented 5 years ago

Maybe the solution is to have register_bounds live somewhere outside gpytorch.Module? I'm not sure exactly what the right solution is, but I also would like to avoid bloating gpytorch.Module with register_* methods if possible.

Alternatively, is there some way that we can metaprogram a register_* thing onto modules, which could clean up all the things that we register (extra loss terms, priors, etc.)?

All in all, I wouldn't be opposed to this, if we do as @Balandat suggests and include an example w/ clamping. However, I do think that this might be a feature that belongs in PyTorch proper since it's not necessarily GP specific.

Balandat commented 5 years ago

Yeah ultimately I we'll want some way of specifying constraints in pytorch to work easily with torch.optim, but that may be a little ways away if there are no algorithms that support that. I mean it would be straightforward to modify most of the existing ones to be projected versions, maybe it's worthwhile writing that PR upstream.

For the time being what one could do is have a simple Bounds structure outside of module that specifies bounds (or even more general constraints) . The simplest case would just be some dictionary mapping parameter names to bounds. Then this could be used appropriately in the optimization loop. The downside of this is that it requires a bunch of manual bookkeeping of the parameter name (with all the submodule dots...). I guess you could do something like write a function add_bounds(bounds, module, parameter_name, lower=None, upper=None) that takes in some external Bounds object bounds, the current module module (to automatically construct the full parameter name).

We can have this live totally outside of gpytorch.

cc @sdaulton, @bletham, @bkarrer

jacobrgardner commented 5 years ago

Expanding on Geoff’s idea to meta program all register_* and corresponding named_* methods, what if gpytorch.Module just implemented register_parameter_metadata(parameter_name, meta_name, **kwargs). Then a call would be like:

self.likelihood.register_parameter_metadata(‘raw_noise’, ‘prior’, ...)

named_parameter_metadata would require a meta_name as an argument instead of taking no args like e.g. named_priors does now. This is easily done without breaking since the existing methods can just call the new ones, and we may not even want to deprecated the prior specific ones since they are much more commonly used.

Edit: Come to think of it, it would be great if we could then just metaprogram Module so that register_* is an alias to this base method, so that any metadata could have the same interface as registering priors.

Balandat commented 5 years ago

Hmm yeah that would work, the thing I am a little worried about is to have the function signature for this be really opaque, with a lot of magic happening, making things potentially quite hard to debug.

cornellius-gp / gpytorch

Make likelihoods robust to noise-free input data #399