Yelp / MOE

A global, black box optimization engine for real world metric optimization.
Other
1.3k stars 139 forks source link

Estimating the noise level hyperparameter \sigma_n #451

Open KukumavMozolo opened 8 years ago

KukumavMozolo commented 8 years ago

As far as i know there is no implemented method that supports to estimate the autocorrelated noise (noise_variance) from points_sampled via maximum_likelihood or leave_one_out. I wonder why that is and if you are planning to implement it?

KukumavMozolo commented 8 years ago

Ok i found the answer in parts in another tread. As suntzu86 said:

For each data point, you'd want to provide MOE with the mean & variance. For us, we were > observing user clicks, so we computed mean/var as in a binomial distribution. I'm not sure what > you're measuring, but you'd want either that or mean/var from beta.

P.S. I don't have the reference handy but there are more scientific ways of picking this value (by > based on max likelihood). But implementing these isn't trivial b/c it introduces a dependency on the > hyperparameters and you'd need to optimize everything together. Plus, these estimates seem to > do pretty badly early on when you don't have much data. So I haven't done it here.

KukumavMozolo commented 8 years ago

So in case i want to model the CTR what u are suggesting is to use a beta distribution. But we are pretending that the ctr is normal distributed. I could imagine that the right side tail of the beta might be much longer than a normal distribution. Dont we underestimate the variance of the normal in that case? Another question, how would you update the sigma_n with new information we acquire while we are optimizing the objective. In other words how do to avoid that the gp explains all parameter dependent variations with noise.

suntzu86 commented 8 years ago

Yeah if there's a lot of demand for automatic noise estimation, I could add that in as an option... don't hold your breath though :) But from our experience, cases we ran across had either measured or estimated noise or were known to be noise-free. It seemed better to use these facts about the black box being optimized rather than select an arbitrary-ish (and universal... same noise at every point) noise value based on likelihood.

As for your new question, I'm not sure I totally understand. A couple of points, maybe this helps

Also not sure what you mean by "update sigma_n". If you re-sample an old point and get a new noise value, you can just change it when you pass data to MOE. Technically such a change would require re-tuning the hyperparameters. If you mean what to do when the GP "collapses" and essentially yields 0 variance everywhere (so there's nothing more to optimize/test)... well I don't have a great answer for that. I've really only seen this happen in very low dimensions, but things to try include: