Open Balandat opened 5 years ago
If we wanted to make this usable with external optimizers without having to manually write the full optimization loop, then we need to allow for bounds to be registered with parameters, so that they can be retrieved programmatically from the model. This could be done in a way similar to how priors are handled. The simple version of this would just be lower and upper bounds on the parameter values, although we could think of more general constraints down the road.
Yeah, in practice I think there is never a case that the noise needs to be less than 1e-5 or 1e-6. As a short term solution, we could make it so that the preconditioner clamps down the added noise value to be at most 1e-6? @jacobrgardner
My preferred way to handle this would be to start using add_jitter
in the regression setting (but maybe on the order of 1e-6). By still allowing the actual log noise parameter to be unconstrained, this prevents the total noise from going under 1e-6, but also avoids any boundary problems with the gradient like we saw when we had parameter bounds a while back.
Edit: It’s also worth noting that concerns like this in theory are exactly the reason we have priors. The “right” way to handle the problem is arguably to place a SmoothBoxPrior with an appropriate lower bound.
This is a fair point. Perhaps we want to have a SmoothBoxPrior by default on GaussianLikelihood? Then the user has to put in some effort to make things unstable.
So just using the prior can still lead to some finicky behavior when using optimizers that potentially take long steps (backtracking during line search doesn't help much, b/c if it evaluates for an initial large step size the damage is already done as gpytorch will error out with some NaN
issue).
By adding a minimal jitter, we effectively achieve a lower bound. The only somewhat tricky thing to be aware of is the scaling of this bound - if the training data is standardized then you can use a reasonable value that works for everything, if not you might end up drowning out any signal that you may have.
The limitation of using the jitter is that it addresses this particular issue only. More generally, one might put other constraints on model hyperparameters, it would be nice to be able to use those explicitly in the optimization. I typically use a wrapper of scipy's L-BFGS-B for fitting models, which explicitly supports bounds on the parameters. Ideally the bounds could be stored with model rather than having to be specified externally.
What do you think about the following change to Module
:
self._bounds = OrderedDict()
Then you could do something like
self.register_bounds("raw_noise", lower=self._inv_param_transform(torch.tensor(1e-4)).item())
in GaussianLikelihood
.
This wouldn't change anything in terms of the current functionality. It would however, allow you to read out the bounds and pass them appropriately to an external optimizer. I have a draft implementation fo this that seems to work well.
The one concern I have with the proposed interface is that it essentially has no effect on anything unless the user explicitly leaves torch for optimization if I am reading this right. This is problematic if the user expects the bounds to work.
The other concern I have for this being a general addition is GPU support, since I assume it is impossible to interface with scipy without adding at least some CPU <> GPU communication.
it essentially has no effect on anything unless the user explicitly leaves torch for optimization
Not quite, it's more like it doesn't do anything unless the user explicitly makes use of it. That could be via scipy or other optimizers, but it could be using torch.optim as well. E.g. the user could use these bounds to programmatically clamp parameters within their torch optimization loop without having to specify the parameter names explicitly in the loop. We could make sure in the docstring to be explicit that this is not automatically used, and maybe point to an example how it could be.
The other concern I have for this being a general addition is GPU support, since I assume it is impossible to interface with scipy without adding at least some CPU <> GPU communication.
This will be an issue only if you have to shove back and forth a bunch of data, i.e. if your model parameters are high-dimensional. If we're talking maybe a few dozen of parameters in the case of an exact GP this is pretty negligible compared to computing the for a set of test points. Regarding the implementation: I have some optimizer utilities the live outside gpytorch that (i) automatically transform a model to a scipy-minimize compatible description and (ii) update the model parameters during the optimization steps, making sure to retain the proper dtype and device.
Maybe the solution is to have register_bounds
live somewhere outside gpytorch.Module
? I'm not sure exactly what the right solution is, but I also would like to avoid bloating gpytorch.Module
with register_*
methods if possible.
Alternatively, is there some way that we can metaprogram a register_*
thing onto modules, which could clean up all the things that we register (extra loss terms, priors, etc.)?
All in all, I wouldn't be opposed to this, if we do as @Balandat suggests and include an example w/ clamping. However, I do think that this might be a feature that belongs in PyTorch proper since it's not necessarily GP specific.
Yeah ultimately I we'll want some way of specifying constraints in pytorch to work easily with torch.optim
, but that may be a little ways away if there are no algorithms that support that. I mean it would be straightforward to modify most of the existing ones to be projected versions, maybe it's worthwhile writing that PR upstream.
For the time being what one could do is have a simple Bounds
structure outside of module that specifies bounds (or even more general constraints) . The simplest case would just be some dictionary mapping parameter names to bounds. Then this could be used appropriately in the optimization loop. The downside of this is that it requires a bunch of manual bookkeeping of the parameter name (with all the submodule dots...). I guess you could do something like write a function add_bounds(bounds, module, parameter_name, lower=None, upper=None)
that takes in some external Bounds
object bounds
, the current module module
(to automatically construct the full parameter name).
We can have this live totally outside of gpytorch.
cc @sdaulton, @bletham, @bkarrer
Expanding on Geoff’s idea to meta program all register_*
and corresponding named_*
methods, what if gpytorch.Module
just implemented register_parameter_metadata(parameter_name, meta_name, **kwargs)
. Then a call would be like:
self.likelihood.register_parameter_metadata(‘raw_noise’, ‘prior’, ...)
named_parameter_metadata
would require a meta_name
as an argument instead of taking no args like e.g. named_priors
does now. This is easily done without breaking since the existing methods can just call the new ones, and we may not even want to deprecated the prior specific ones since they are much more commonly used.
Edit: Come to think of it, it would be great if we could then just metaprogram Module
so that register_*
is an alias to this base method, so that any metadata could have the same interface as registering priors.
Hmm yeah that would work, the thing I am a little worried about is to have the function signature for this be really opaque, with a lot of magic happening, making things potentially quite hard to debug.
We want to have models behave reasonably when fit to "sufficiently nice" data. Right now if you fit an Exact GP with a
GaussianLikelihood
to such target data (e.g. a constant or a linear function), the optimization will error out with the following insightful error message:This is because the most likely value for the
noise
parameter is zero (or close to it), which in both theexp
andsoftplus
parameterizations is approaching-inf
, so the optimization runs intoNaN
values sooner or later.Not quite sure what the best way to deal with this would be. Ideally we'd enforce a minimal noise level directly in the optimizer (if it supports constraints). For the pytorch optimizers clamping would be an option.