Open darthdeus opened 5 years ago
I'm guessing that the problem is because several hyperparameter configurations lead to very similar marginal likelihoods. What happens is the ML estimation finds different solutions when you run it. It's not the GP that's fragile but the use of a single (non-Bayesian/probabilistic) setting for these hyperparameters. One can imagine either the marginal likelihood landscape is very flat (so it doesn't care where it optimises) or very hilly with lots of local maxima/minima*.
In order of correctness: (1) Something I've meant to do for years was get the CCD stuff sorted out. I think clearly there's enough people who need it that I should probably get that done soon! To summarise: This effectively finds the posterior given different hyperparameter configurations and then weighted-averages these (weighted by the associated marginal likelihoods - the upshot is that uncertainty over the variance in this case would appear in the uncertainty in the posterior (over models). This is the right thing to do. Sorry I've not done it yet!
(2) A quick solution if you've a prior belief on what the hyperparameters should be is to use a prior over the hyperparameters, that will shift where the ML solution ends up. I can't remember exactly how to do this, something like m.kern.variance.prior = GPy.priors.Gamma(1,0.1)
(3) You can put severely strict "priors" that restrict the range a parameter can take, e.g. k.lengthscale.constrain_bounded(1,10).
(4) It might also be worth looking at optimize_restarts which tries several times from different starting points, to ensure the ML solution isn't a poor local minimum.
(5) Or just use optimize but set your hyperparameters before hand to be near where you think the right answer is.
Hope some of these help. Sorry again (1) isn't implemented yet. Mike.
*I think it optimises the negative-log-likelihood so technically it's minima we're looking for, I think...bit confusing.
Thank you so much for such a quick and detailed response. You really saved me a lot of time :)
I'm guessing that the problem is because several hyperparameter configurations lead to very similar marginal likelihoods. What happens is the ML estimation finds different solutions when you run it. It's not the GP that's fragile but the use of a single (non-Bayesian/probabilistic) setting for these hyperparameters. One can imagine either the marginal likelihood landscape is very flat (so it doesn't care where it optimises) or very hilly with lots of local maxima/minima*.
This makes a lot of sense. I've tried running with many restarts and it always finds a different a different set of parameters.
I wonder what would be a good way of exploring the likelihood surface visually. I tried plotting lengthscale and variance as likelihood as Z, and it is extremely flat (image https://i.imgur.com/io1JMHy.png).
Just out of curiosity, do you think plotting it this way on a grid makes sense? In this case when the likelihood is smooth it's probably ok. But I've been playing with GPs for a while and sometimes it ends up being weird in exactly the places where I don't plot.
I'm not sure if there's some more general approach, like MCMC sampling the parameters and doing a scatter plot of the samples. But then they'd need priors, and given how huge the range is in this case (0; 10^7), I'm not sure if a uniform prior would make sense, since it seems to keep going lower and lower (but also flatter).
(2) A quick solution if you've a prior belief on what the hyperparameters should be is to use a prior over the hyperparameters, that will shift where the ML solution ends up. I can't remember exactly how to do this, something like m.kern.variance.prior = GPy.priors.Gamma(1,0.1)
I wasn't aware you could set priors in GPy. This is great!
For anyone referencing this later, the syntax is
m.kern.variance.set_prior(GPy.priors.Gamma(1, 0.1))
.
(4) It might also be worth looking at optimize_restarts which tries several times from different starting points, to ensure the ML solution isn't a poor local minimum.
Yeah as mentioned above, I've tried this, but given how flat it is (as seen in the image), I guess the optimizer just converges to an arbitrary point on the surface.
Hope some of these help. Sorry again (1) isn't implemented yet. Mike.
Thank you again for such detailed answer. You really helped me a lot :)
@lionfish0 I wonder, wouldn't fixing the random seed should lead to get the same exact solution every time? I'm having issues reproducing my results because the predicted values of my parameters change pretty much every time (a couple of decimal points) that I run my code, despite fixing numpy's random seed! How can I resolve this issue?
Note that I'm constraining my hyperparameters as follow:
Matern32 = GPy.kern.Matern32(input_dim=4, ARD=True)
Matern32.lengthscale.constrain_bounded(1e-8, 12.0, warning=False)
Matern32.variance.constrain_bounded(1e-1, 100.0, warning=False)
GPy.models.GPRegression(x, y, normalizer=True, noise_var=1e-4, kernel=Matern32)
@lionfish0 For example one time I get 1.0034057019672782
and another time I get 1.0034057019688161
...
@darthdeus For float64, did you simply convert your x
and y
to np.float64
?
This is so helpful. Thanks :)
@darthdeus Hi Jakub, could you share how you made the likelihood surface (image https://i.imgur.com/io1JMHy.png)? I googled around to find the function for this without a success. Thanks.
I've run into a somewhat weird case where optimizing the kernel parameters leads to very high values of variance with both RBF and Matern52 kernel. Regardless if I normalize the inputs, I get about the same likelihood (NLL) and very similar kernel parameters for both kernels, and adding more data points doesn't change this.
Here's a smaller test case with just five data points
Changing the last X item from
0.999
to0.999999
increases the variance by an order of magnitudeNotice how the rbf.variance is on the order of tens to hundreds of thousands. Very minor changes in the data cause this to change by large amounts. for example, here's a changed version of the first input where the variance suddenly is 600k
Small change drops variance from 63000 to 0.65
I created these 5 points by trying to remove as many as I could from the following dataset while keeping the variance extremely high. Here's another modified example, where the second and third values in X are changed from
0.428
and0.427
to0.40
and0.44
and the suddenly the resulting variance is just0.65
, instead of the previous63415
.Same problem with 10 data points
Here is yet another example, but with a 10 points instead of 5. I actually started with this dataset and tried to simplify and remove points as much as I could while keeping the weird behavior. In this case, the variance goes up to 3.7million.
Overflows, underflows, restarts and invalid values
As I've been trying to figure this out for a few hours, I'm also sometimes getting numerical errors, specifically stuff like this
I don't have all the errors, but I've seen both overflows and underflows occur sometimes. Usually running
optimize_restarts()
with a larger number of restarts causes a few of them to fail, though the optimizer still converges to roughly the same values (even 1e7). Note that this happens even if I usenp.float64
.How do I diagnose this? What is actually going on?
Prior to running into this issue I've been actually working on my own implementation of GPs, and tried to use GPy as a solution to my own numerical issues, but it seems that either I'm doing something fundamentally wrong, or GPs are just a lot more fragile?
I'm not sure if there is any more stable way to optimize the kernel. What I'm ultimately after is bayesian optimization, so I don't really need to fit the kernel perfectly in case the data is a bit weird (which I don't see it being in this case), but I'd like to avoid pathological cases where the optimization blows up.
Is there some way to make the optimization more robust, or "safer", in the sense of putting a stronger prior towards smoother GPs or noisy inputs? Or is there something that I'm just fundamentally missing?
edit: I realized something after waking up today. The data in question is actually linear (sampled from a nearly noiseless linear function), which I'm not sure if is the cause of the numerical instability, but definitely worth mentioning.
One thought I had is to constrain the
Gaussian_noise
to a higher value than0
, but I'm not sure if that is a good solution, or if there is something better to try.