Hyperparameter optimization fitting weird values on very small data (5-10 samples)

darthdeus commented 5 years ago

I've run into a somewhat weird case where optimizing the kernel parameters leads to very high values of variance with both RBF and Matern52 kernel. Regardless if I normalize the inputs, I get about the same likelihood (NLL) and very similar kernel parameters for both kernels, and adding more data points doesn't change this.

Here's a smaller test case with just five data points

X = np.array([
    [0.032, 0.745],
    [0.428, 0.916],
    [0.427, 0.574],
    [0.938, 0.425],
    [0.999, 0.999],
], dtype=np.float32)

Y = np.array([
    [1.5244],
    [2.2617],
    [1.577],
    [1.79],
    [3.0],
], dtype=np.float32)

m = GPy.models.GPRegression(X, Y, normalizer=True)
m.optimize()
print(m)

m = GPy.models.GPRegression(X, Y)
m.optimize()
print(m)

Name : GP regression
Objective : 1.2008143563132885
Number of Parameters : 3
Number of Optimization Parameters : 3
Updates : True
Parameters:
  GP_regression.           |                   value  |  constraints  |  priors
  rbf.variance             |      151063.25336423848  |      +ve      |        
  rbf.lengthscale          |      191.19406798472033  |      +ve      |        
  Gaussian_noise.variance  |  5.562684646268137e-309  |      +ve      |        

Name : GP regression
Objective : -1.776857320119305
Number of Parameters : 3
Number of Optimization Parameters : 3
Updates : True
Parameters:
  GP_regression.           |                   value  |  constraints  |  priors
  rbf.variance             |       45818.36132621084  |      +ve      |        
  rbf.lengthscale          |       191.2225103915742  |      +ve      |        
  Gaussian_noise.variance  |  5.562684646268137e-309  |      +ve      |

Changing the last X item from `0.999` to `0.999999` increases the variance by an order of magnitude

Notice how the rbf.variance is on the order of tens to hundreds of thousands. Very minor changes in the data cause this to change by large amounts. for example, here's a changed version of the first input where the variance suddenly is 600k

X = np.array([
    [0.032, 0.745],
    [0.428, 0.916],
    [0.427, 0.574],
    [0.938, 0.425],
    [0.999999, 0.999],
], dtype=np.float32)

Y = np.array([
    [1.5244],
    [2.2617],
    [1.577],
    [1.79],
    [3.0],
], dtype=np.float32)

m = GPy.models.GPRegression(X, Y, normalizer=True)
m.optimize()
print(m.param_array)
# [6.34150113e+005 3.92048931e+002 5.56268465e-309]

Small change drops variance from 63000 to 0.65

I created these 5 points by trying to remove as many as I could from the following dataset while keeping the variance extremely high. Here's another modified example, where the second and third values in X are changed from 0.428 and 0.427 to 0.40 and 0.44 and the suddenly the resulting variance is just 0.65, instead of the previous 63415.

X = np.array([
    [0.032, 0.745],
    [0.40, 0.916],
    [0.44, 0.574],
    [0.938, 0.425],
    [0.999999, 0.999],
], dtype=np.float32)

Y = np.array([
    [1.5244],
    [2.2617],
    [1.577],
    [1.79],
    [3.0],
], dtype=np.float32)

m = GPy.models.GPRegression(X, Y, normalizer=True)
m.optimize()
print(m)
# [0.65296766 0.06262393 0.34703868]

Same problem with 10 data points

Here is yet another example, but with a 10 points instead of 5. I actually started with this dataset and tried to simplify and remove points as much as I could while keeping the weird behavior. In this case, the variance goes up to 3.7million.

X = np.array([
    [0.9098032 , 0.10719252],
    [0.07689123, 0.        ],
    [0.92378354, 0.29961774],
    [0.93852574, 0.42585665],
    [1.        , 1.        ],
    [0.03271898, 0.7458292 ],
    [0.4280041 , 0.9168633 ],
    [0.42700773, 0.57499546],
    [0.4229433 , 0.9811548 ],
    [0.90882766, 0.49437103]
], dtype=np.float32)

Y = np.array([
    [1.12418827],
    [0.07689123],
    [1.52301903],
    [1.79023902],
    [3.        ],
    [1.52437748],
    [2.26173076],
    [1.57699869],
    [2.38525289],
    [1.89756974]
], dtype=np.float32)

m = GPy.models.GPRegression(X, Y, normalizer=True)
m.optimize()
print(m)
# [3.78870122e+006 1.75181624e+003 5.56268465e-309]

m = GPy.models.GPRegression(X, Y)
m.optimize()
print(m)
# [1.92090099e+006 1.62450526e+003 5.56268465e-309]

Overflows, underflows, restarts and invalid values

As I've been trying to figure this out for a few hours, I'm also sometimes getting numerical errors, specifically stuff like this

 /home/darth/projects/bopt/.venv/lib/python3.6/site-packages/GPy/kern/src/stationary.py:168: RuntimeWarning:overflow encountered in true_divide
 /home/darth/projects/bopt/.venv/lib/python3.6/site-packages/GPy/kern/src/rbf.py:51: RuntimeWarning:overflow encountered in square
 /home/darth/projects/bopt/.venv/lib/python3.6/site-packages/GPy/kern/src/rbf.py:54: RuntimeWarning:invalid value encountered in multiply

I don't have all the errors, but I've seen both overflows and underflows occur sometimes. Usually running optimize_restarts() with a larger number of restarts causes a few of them to fail, though the optimizer still converges to roughly the same values (even 1e7). Note that this happens even if I use np.float64.

How do I diagnose this? What is actually going on?

Prior to running into this issue I've been actually working on my own implementation of GPs, and tried to use GPy as a solution to my own numerical issues, but it seems that either I'm doing something fundamentally wrong, or GPs are just a lot more fragile?

I'm not sure if there is any more stable way to optimize the kernel. What I'm ultimately after is bayesian optimization, so I don't really need to fit the kernel perfectly in case the data is a bit weird (which I don't see it being in this case), but I'd like to avoid pathological cases where the optimization blows up.

Is there some way to make the optimization more robust, or "safer", in the sense of putting a stronger prior towards smoother GPs or noisy inputs? Or is there something that I'm just fundamentally missing?

edit: I realized something after waking up today. The data in question is actually linear (sampled from a nearly noiseless linear function), which I'm not sure if is the cause of the numerical instability, but definitely worth mentioning.

One thought I had is to constrain the Gaussian_noise to a higher value than 0, but I'm not sure if that is a good solution, or if there is something better to try.

lionfish0 commented 5 years ago

I'm guessing that the problem is because several hyperparameter configurations lead to very similar marginal likelihoods. What happens is the ML estimation finds different solutions when you run it. It's not the GP that's fragile but the use of a single (non-Bayesian/probabilistic) setting for these hyperparameters. One can imagine either the marginal likelihood landscape is very flat (so it doesn't care where it optimises) or very hilly with lots of local maxima/minima*.

In order of correctness: (1) Something I've meant to do for years was get the CCD stuff sorted out. I think clearly there's enough people who need it that I should probably get that done soon! To summarise: This effectively finds the posterior given different hyperparameter configurations and then weighted-averages these (weighted by the associated marginal likelihoods - the upshot is that uncertainty over the variance in this case would appear in the uncertainty in the posterior (over models). This is the right thing to do. Sorry I've not done it yet!

(2) A quick solution if you've a prior belief on what the hyperparameters should be is to use a prior over the hyperparameters, that will shift where the ML solution ends up. I can't remember exactly how to do this, something like m.kern.variance.prior = GPy.priors.Gamma(1,0.1)

(3) You can put severely strict "priors" that restrict the range a parameter can take, e.g. k.lengthscale.constrain_bounded(1,10).

(4) It might also be worth looking at optimize_restarts which tries several times from different starting points, to ensure the ML solution isn't a poor local minimum.

(5) Or just use optimize but set your hyperparameters before hand to be near where you think the right answer is.

Hope some of these help. Sorry again (1) isn't implemented yet. Mike.

*I think it optimises the negative-log-likelihood so technically it's minima we're looking for, I think...bit confusing.

darthdeus commented 5 years ago

Thank you so much for such a quick and detailed response. You really saved me a lot of time :)

I'm guessing that the problem is because several hyperparameter configurations lead to very similar marginal likelihoods. What happens is the ML estimation finds different solutions when you run it. It's not the GP that's fragile but the use of a single (non-Bayesian/probabilistic) setting for these hyperparameters. One can imagine either the marginal likelihood landscape is very flat (so it doesn't care where it optimises) or very hilly with lots of local maxima/minima*.

This makes a lot of sense. I've tried running with many restarts and it always finds a different a different set of parameters.

I wonder what would be a good way of exploring the likelihood surface visually. I tried plotting lengthscale and variance as likelihood as Z, and it is extremely flat (image https://i.imgur.com/io1JMHy.png).

Just out of curiosity, do you think plotting it this way on a grid makes sense? In this case when the likelihood is smooth it's probably ok. But I've been playing with GPs for a while and sometimes it ends up being weird in exactly the places where I don't plot.

I'm not sure if there's some more general approach, like MCMC sampling the parameters and doing a scatter plot of the samples. But then they'd need priors, and given how huge the range is in this case (0; 10^7), I'm not sure if a uniform prior would make sense, since it seems to keep going lower and lower (but also flatter).

(2) A quick solution if you've a prior belief on what the hyperparameters should be is to use a prior over the hyperparameters, that will shift where the ML solution ends up. I can't remember exactly how to do this, something like m.kern.variance.prior = GPy.priors.Gamma(1,0.1)

I wasn't aware you could set priors in GPy. This is great!

For anyone referencing this later, the syntax is m.kern.variance.set_prior(GPy.priors.Gamma(1, 0.1)).

(4) It might also be worth looking at optimize_restarts which tries several times from different starting points, to ensure the ML solution isn't a poor local minimum.

Yeah as mentioned above, I've tried this, but given how flat it is (as seen in the image), I guess the optimizer just converges to an arbitrary point on the surface.

Hope some of these help. Sorry again (1) isn't implemented yet. Mike.

Thank you again for such detailed answer. You really helped me a lot :)

Amir-Arsalan commented 4 years ago

@lionfish0 I wonder, wouldn't fixing the random seed should lead to get the same exact solution every time? I'm having issues reproducing my results because the predicted values of my parameters change pretty much every time (a couple of decimal points) that I run my code, despite fixing numpy's random seed! How can I resolve this issue?

Note that I'm constraining my hyperparameters as follow:

Matern32 = GPy.kern.Matern32(input_dim=4, ARD=True)
Matern32.lengthscale.constrain_bounded(1e-8, 12.0, warning=False)
Matern32.variance.constrain_bounded(1e-1, 100.0, warning=False)

GPy.models.GPRegression(x, y, normalizer=True, noise_var=1e-4, kernel=Matern32)

Amir-Arsalan commented 4 years ago

@lionfish0 For example one time I get 1.0034057019672782 and another time I get 1.0034057019688161 ...

Amir-Arsalan commented 4 years ago

@darthdeus For float64, did you simply convert your x and y to np.float64?

marcosfelt commented 4 years ago

This is so helpful. Thanks :)

flydream0428 commented 3 years ago

@darthdeus Hi Jakub, could you share how you made the likelihood surface (image https://i.imgur.com/io1JMHy.png)? I googled around to find the function for this without a success. Thanks.

SheffieldML / GPy