SheffieldML / GPy

Gaussian processes framework in python
BSD 3-Clause "New" or "Revised" License
2.01k stars 558 forks source link

Kernel variance vs. Gaussian noise variance #848

Open jmren168 opened 4 years ago

jmren168 commented 4 years ago

Hi,

In my cases, when I optimized GPy.model.GPRegression(kernel=RBF), I got different results

  1. kernel variance (~0.00001) is smaller than Gaussian noise variance (~1) . What's the meaning of this result?
  2. kernel variance (~10) is larger than Gaussian noise variance (~0.009). What's the meaning of this result?

Any helps would be highly appreciated.

Best, JM

lawrennd commented 4 years ago

In (1) the model has found a minima where the signal to noise ratio is very low (for a simple RBF kernel you can divide the variance of the RBF by the variance of the Gaussian noise to find this ratio).

In (2) the opposite has happened.

These can be local minima, you need to be sensitive to initialisation, such as the kernel lengthscale, and perhaps try some different starting points.

This paper uses signal to noise ratios to find 'quiet genes' in gene expression. It might be helpful.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-180

jmren168 commented 4 years ago

Thanks for the reply. I've tried different starting points (via optimize_restart), and got similar results.

BTW, how to decide whether a GPRegression is fitted well? My option is that kernel variance is larger than Gaussian noise variance, and Gaussian noise variance is about 0.25~0.5.

Any comments are helpful. JM

lawrennd commented 4 years ago

If you have a look at the paper, the key is to look at the likelihood of the different fits. Often one likelihood will be far larger than the other.

lionfish0 commented 4 years ago

Do you have any priors you can bring to the problem?

On a scale of easy-solution to most-principled-solution: a) Rather than use the random restarts in the optimize_restart method, could you initialise the parameters roughly to be the values you expect, so it is likely to find the correct (hopefully global) maximum likelihood. b) Put constraints to keep a parameter in a known bound: use "constrain_bounded"? c) Slightly more principled is to add a prior. use "set_prior", and then use normal ML estimation. d) More principled still is to integrate (sample) over the hyperparameters.

Just some thoughts. Mike

On Tue, 23 Jun 2020 at 09:13, Neil Lawrence notifications@github.com wrote:

If you have a look at the paper, the key is to look at the likelihood of the different fits. Often one likelihood will be far larger than the other.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SheffieldML/GPy/issues/848#issuecomment-647985139, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4MGQAJG2UMOJTELMKM6GLRYBP3FANCNFSM4OFLZGVQ .

jmren168 commented 4 years ago

Thanks for the reply. Not sure how to bring the priors to the problem.

More details about this REAL case.

  1. ~30 samples, say x1, x2, ..., x30, and each of them is of dimensionality 100.
  2. Only 3~5 values of x2 are different from x1; the same phenomena appears in the comparison of x3 and x2. In addition, 30%~50% of these 100 values are the same.

One way we think is to use Leave-one-out (LOO) to select possible kernel variance and length-scale and then use each of these LOO selected kernel parameters to set constrain_bound. But not sure if this is correct or not.

Any suggestions are appreciated. JM

zhenwendai commented 4 years ago

If you do not use ARD, which gives one length scale per input dimension, the model should not overfit, but the model may treat everything as noise (kernel variance is close to zero).

If you need ARD, point estimate with cross validation or MCMC could be a solution to the problem.

jmren168 commented 4 years ago

@lionfish0 Hi Mike,

Do you have any further references of (c) or (d)? Thanks in advanced.

c) Slightly more principled is to add a prior. use "set_prior", and then use normal ML estimation. d) More principled still is to integrate (sample) over the hyperparameters.

Best, JM

jmren168 commented 4 years ago

I found a paper discussing about estimating kernel variance via MLE and LOOCV.

F. Bachoc, Cross Validation and Maximum Likelihood estimations of hyper-parameters of Gaussian processes with model misspecification. Computational Statistics & Data Analysis 66 (2013): 55-69.

jmren168 commented 4 years ago

Hi,

After using ARD, I found some phenomena of "lengthscale" results.

  1. a large lengthscale, reaching to the lower bound of constraint_bound setting, say 1000;
  2. a very small lengthscale, say 0.00000001.

My comments for these two cases:

  1. if values of a dimensionality does not affect y at all, its corresponding optimized lengthscale is large;
  2. a dimensionality with the smallest lengthscale should affect y more.

Please correct me if I'm wrong. Many thanks.