Estimating the noise level hyperparameter \sigma_n

KukumavMozolo commented 8 years ago

As far as i know there is no implemented method that supports to estimate the autocorrelated noise (noise_variance) from points_sampled via maximum_likelihood or leave_one_out. I wonder why that is and if you are planning to implement it?

KukumavMozolo commented 8 years ago

Ok i found the answer in parts in another tread. As suntzu86 said:

For each data point, you'd want to provide MOE with the mean & variance. For us, we were > observing user clicks, so we computed mean/var as in a binomial distribution. I'm not sure what > you're measuring, but you'd want either that or mean/var from beta.

P.S. I don't have the reference handy but there are more scientific ways of picking this value (by > based on max likelihood). But implementing these isn't trivial b/c it introduces a dependency on the > hyperparameters and you'd need to optimize everything together. Plus, these estimates seem to > do pretty badly early on when you don't have much data. So I haven't done it here.

KukumavMozolo commented 8 years ago

So in case i want to model the CTR what u are suggesting is to use a beta distribution. But we are pretending that the ctr is normal distributed. I could imagine that the right side tail of the beta might be much longer than a normal distribution. Dont we underestimate the variance of the normal in that case? Another question, how would you update the sigma_n with new information we acquire while we are optimizing the objective. In other words how do to avoid that the gp explains all parameter dependent variations with noise.

suntzu86 commented 8 years ago

Yeah if there's a lot of demand for automatic noise estimation, I could add that in as an option... don't hold your breath though :) But from our experience, cases we ran across had either measured or estimated noise or were known to be noise-free. It seemed better to use these facts about the black box being optimized rather than select an arbitrary-ish (and universal... same noise at every point) noise value based on likelihood.

As for your new question, I'm not sure I totally understand. A couple of points, maybe this helps

A GP is a Gaussian at "every point." But that does not mean the GP assumes the underlying system it is modeling is Gaussian or even Gaussian-like. (Deviate too far like something with lots of discontinuities and performance will suffer, but it will still work.)
A Gaussian has mean (1st moment) and variance (2nd moment) with all higher moments being 0. I believe it is the only nontrivial distribution with this property. So at locations where mean & variance are known, we tell MOE the appropriate values.
Computing mean/variance of CTR using "normal" statistics is actually incorrect b/c the CTR distribution is not Gaussian.

Also not sure what you mean by "update sigma_n". If you re-sample an old point and get a new noise value, you can just change it when you pass data to MOE. Technically such a change would require re-tuning the hyperparameters. If you mean what to do when the GP "collapses" and essentially yields 0 variance everywhere (so there's nothing more to optimize/test)... well I don't have a great answer for that. I've really only seen this happen in very low dimensions, but things to try include:

randomly throw out some points
shorten or lengthen hyperparameters (e.g., say we have samples at 0.1, 0.2, 0.7, 0.8 with length scale = 0.7 and the GP "collapses." You're probably thinking, how does MOE know nothing interesting happens btwn 0.2 and 0.7? It doesn't, but it assumes it does b/c the length scale is relatively large.) It didn't come up in any experiments we ran on real data, so I haven't spent much time thinking about it.

Yelp / MOE

Estimating the noise level hyperparameter \sigma_n #451