GPflow / GPflowOpt

Bayesian Optimization using GPflow
Apache License 2.0
270 stars 61 forks source link

Cholesky faillures due to inappropriate initial hyperparameters #7

Closed javdrher closed 7 years ago

javdrher commented 7 years ago

As mentioned #4 , tests often failed (mostly on python 2.7) due to cholesky decomposition errors. First I thought this was mostly caused by updating the data and calling optimize() again, but resetting the hyperparameters wasn't working all the time. Increasing the likelihood variance sometimes helps slightly but isn't very robust either.

Right now the tests specify lengthscales for the initial model, and apply a hyperprior on the kernel variance. Each BO iteration, the hyperparameters supplied with the initial model are applied as a starting point. In addition, restarts are applied by randomizing the Params. This approach made it a lot more stable but isn't perfect yet. Especially in longer runs of BO, reverting to the supplied lengthscales each time ultimately causes crashes.

Some things we may consider:

I'd love to hear thoughts on how to improve this.

1Reinier commented 7 years ago

This is not necessarily my area of expertise but I assume Cholesky is failing be because the covariance matrix is not positive definite. Is that correct? If so, how could this be the case?

I understand Cholesky is preferable for inversion problems due to performance. Would be nice if there is a more stable fallback option such as QR in GPflow. But that's out of our hands I suppose.

jameshensman commented 7 years ago

My experience is that normalization of both inputs and outputs is helpful. I don't agree that we lose any interpretability of the lengthscales: we can always just rescale them. My suggestion is to normalize by default, allow experts to turn it off.

I'm curious to know where the numerical issues are coming from. Are they happening when there are very few points? Or when we have many close-together points? The first suggests that the optimizer is reaching a poor setting of the lengthscales, fixed by sensible prior regularization (a sensible default is possible if the inputs are normalized). The second is quite common in BO and the simplest solution is to increase the noise: perhaps the noise variance should not be allowed to be smaller than, say 1e-3 (note that the scale of Y should be 1 if the data are normalized).

javdrher commented 7 years ago

I was thinking towards auto-normalization indeed, in python a wrapping class which automatically takes care of scalings should be straightforward. However, I would like to keep the model as much as a "black-box" as possible so it remains possible to use other kernels and plug in different models without writing wrappers for all of them. That makes automatic rescaling of hyperparameters a bit tricky as under this setting, as it isn't clear what hyperparameters are being optimized.

For the tests, the instability is mostly occuring for python 2.7. The typical scenario prior to a crash are huge lengthscales and a large kernel variance. With 3.5 it seems to be a lot more stable. It occurs with grid-based designs of size 16 or 25. I'm thinking the grids might actually the optimization of the lengthscales a bit binary?

icouckuy commented 7 years ago

As far as I know it occurs at the start of the BO, when little data is available. In that case the signal variance (and lengthscales) explodes while the noise variance stays at 1, causing instability.

Bounding the optimization is indeed one option. Alternatively (or rather, in addition) I was also wondering if it is possible to catch this in GPFlow and return gracefully, where the model state is set to the last found hyperparameters that still worked (if any). In my opinion, the problem will then fix itself when more data becomes available (optionally helped along by MCMC sampling the hyps too).

Normalization is a given, but I was wondering if this would be the responsibility of GPFlowOpt or the user? I'm thinking to leave it for the user to decide for himself, it would ease the complexity. For instance, predictive entropy relies directly on the hyperparameters which each would then need to be (un)scaled correctly.

javdrher commented 7 years ago

For the normalization I was thinking about the following wrapper for a model in the Acquisition class (not complete and has some issues)

class Normalizer(Model):
    def __init__(self, model, normalize_input=True, normalize_output=True):
        assert (model is not None)
        self._wrapped = model
        super(Normalizer, self).__init__()

        self._input = normalize_input
        self.input_transl = DataHolder(np.zeros((model.X.shape[1],)))
        self.input_scale = DataHolder(np.ones((model.X.shape[1],)))

        self._output = normalize_output
        self.output_transl = DataHolder(np.zeros((model.Y.shape[1],)))
        self.output_scale = DataHolder(np.ones((model.Y.shape[1],)))

        self.X = model.X.value
        self.Y = model.Y.value

    def __getattr__(self, item):
        return self.__getattribute__('_wrapped').__getattribute__(item)

    def __setattr__(self, key, value):
        if key is '_wrapped':
            object.__setattr__(self, key, value)
            value.__setattr__('_parent', self)
            return

        super(Normalizer, self).__setattr__(key, value)

    # Overwrites

    @property
    def X(self):
        return DataHolder(self._wrapped.X.value * self.input_scale.value + self.input_transl.value)

    @property
    def Y(self):
        return DataHolder(self._wrapped.Y.value * self.output_scale.value + self.output_transl.value)

    @X.setter
    def X(self, value):
        if self._input:
            self.input_transl = value.mean(axis=0)
            self.input_scale = value.std(axis=0)
        self._wrapped.X = (value - self.input_transl.value) / self.input_scale.value

    @Y.setter
    def Y(self, value):
        if self._output:
            self.output_transl = value.mean(axis=0)
            self.output_scale = value.std(axis=0)
        self._wrapped.Y = (value - self.output_transl.value) / self.output_scale.value

    @AutoFlow((float_type, [None, None]))
    def predict_f(self, X):
        pred_f, pred_var = self._wrapped.build_predict((X - self.input_transl) / self.input_scale)
        return pred_f * self.output_scale + self.output_transl, pred_var * (self.output_scale ** 2)
1Reinier commented 7 years ago

That code looks wonderful. I think normalization is helpful for sure. The interpretation of a prior's parameters may change if the data scaling changes, which needs to be made clear to user I think.

As to the Cholesky issue, as long as the covariance matrix is positive definite and not too rank deficient the decomposition should work. The following things could help:

Since the third option is eventually in the hands of the user (if they don't use the default priors) options 1 and 2 are probably best to go forward with, which your code seems to effectively do.

jameshensman commented 7 years ago

One simple strategy for the lengthscales would be to set them to (e.g.) 0.8 times the length of the domain until we have sufficient points to estimate them more reasonably than that. For example, if the number of data is smaller than the dimension of the data, it's going to be impossible to estimate the lengthscales anyway, so I suggest sticking with the above estimate.

jameshensman commented 7 years ago

Two things with normalization:

1) I suggest that inputs are normalized to [0, 1]. Not based on the data, but based on the domain that the user specifies.

2) be careful dividing by the std(), which is zero if n==1.

javdrher commented 7 years ago

So, in summary we could do the following:

@jameshensman , regarding setting the initial default lengthscales to 0.8 * the size of the domain. What would be your suggestion to find the lengthscale Params in the model? Should we traverse the tree below model.kern and look for lengthscale Params ?

I was testing a bit further last night and I think for the tests, it might be sufficient we get rid of the grid designs used now and use something better. With just random points it was much harder to reproduce the instability (although those aren't ideal for automated tests either).

icouckuy commented 7 years ago

Provide an automated normalizer (which can be switched off). It scales output to N(0,1) and input to [0, 1].

Agree, this makes sense and still makes sure the scale of the input and output is similar.

Change _optimize_all in acquisition.py to something like this:

javdrher commented 7 years ago

maybe rename optimize_all to optimize_models (I have a thing for proper names) :)

we would need a separate callback for each model in the list? makes sense

we would need a separate callback for each model in the list?

or pass the list, or the index. As this callback is intended to provide control for specific situations, any way to find out what the models are would do I guess

how would the user know which normalization is used. This is passed along with the callback?

I'd remove all scaling wrappers so in the callback the user has the GPflow model, with the normalized data in it. Based on that, priors, fixes, transforms and values can be adjusted.

javdrher commented 7 years ago

The problem disappeared completely by switching from the grids to latin hypercubes. I removed all hyperparameter initialization, scalings and the hyperprior. It works out of the box.

Nevertheless, supporting the scalings still seems useful.

javdrher commented 7 years ago

Steps to finalize the solutions to this issue:

javdrher commented 7 years ago

All steps were implemented, closing this issue for now.