Yelp / MOE

A global, black box optimization engine for real world metric optimization.
Other
1.31k stars 140 forks source link

MOE crashes when two points are identical (with no noise) #17

Closed sc932 closed 10 years ago

sc932 commented 10 years ago

LinAlgError: Singular matrix

We should combine these two points instead of putting them both into the matrix. If they are the exact same point, with the exact same value, with no noise, then they are the same point and should be treated as such (instead of crashing).

sc932 commented 10 years ago

This also happens when they have different values (but still no noise). This case is user error, there is no way for the same point to have two different values with no error (it is no longer a function). Right now we fail with

LinAlgError: Singular matrix

But for this specific case we should give more context that the user put in impossible data.

suntzu86 commented 10 years ago

I think both cases are user error. In the first case, why did it happen? Did the user mean to input a different point but made an error? Duplicating points (and what is dupe: exact match? within epsilon?) smells like a mistake, and I'd rather fail than swallow it and do something potentially surprising.

In the second case, 'impossible' data isn't obvious (but it is probable). The user might be running MOE w/o exposing all parameters to us (e.g., time). So they made precise measurements at 2 different times. They need to expose the hidden parameter or add 'fake' noise.

Lastly, having noise means these scenarios don't result in crashes (and can even be well-behaved numerically if the noise is large enough), but they still feel like questionable use cases.

peter-i-frazier commented 10 years ago

Just to chip in, if there is noise, measuring the same point twice is reasonable user behavior. Combining the two points into one with the appropriate variance seems like the right fix. On Apr 3, 2014 5:26 PM, "suntzu86" notifications@github.com wrote:

I think both cases are user error. In the first case, why did it happen? Did the user mean to input a different point but made an error? Duplicating points (and what is dupe: exact match? within epsilon?) smells like a mistake, and I'd rather fail than swallow it and do something potentially surprising.

In the second case, 'impossible' data isn't obvious (but it is probable). The user might be running MOE w/o exposing all parameters to us (e.g., time). So they made precise measurements at 2 different times. They need to expose the hidden parameter or add 'fake' noise.

Lastly, having noise means these scenarios don't result in crashes (and can even be well-behaved numerically if the noise is large enough), but they still feel like questionable use cases.

Reply to this email directly or view it on GitHubhttps://github.com/sc932/MOE/issues/17#issuecomment-39506999 .

sc932 commented 10 years ago

First case: If we want to treat exact (within 1e-14 or something) matches with no noise as input errors that is fine. We should not treat them as singular matrix linalg errors that crash the system though. Throw an exception that python can catch that explains the problem.

Second case: This is still user error then (they are ignoring either time or noise). We cannot model a non-function with a GP, the base assumption is f ~ GP(mu, sig) and non-functions are not in that space. We should fail with an exception that tells the user what they did wrong. We can give tips on how to fix it in the docs.

Peter: I completely agree, and we have actually done this many times in production (we play "king of the hill" with the best point round over round so the same point can be sampled many times). As long as there is some noise the system handles this fine. We looked into combining them into a single point as well, but wanted the ability to disambiguate them later if needed (to correct a measurement etc). The system works well under this use case though (and we have run into it several times).

Action items: Throw InputValue exceptions when the user gives "bad" data. Fail fast, fail with an explanation and fail in a way that doesn't crash the server.

suntzu86 commented 10 years ago

These errors are actually coming out of the python side (which means they should already be catchable exceptions from numpy), before we hit C++. However the errors would (or at least should) happen in C++ as well. I'm in the process of redoing the interface so we'll revisit this once we have a better idea of what the "final" workflow will look like.

First/Second case: Yeah I agree they're both user errors. Just pointing out that the 2nd case is potentially a much more subtle error than the first 1st.

Peter: As Scott said, so far we've just been inserting all points separately; and since there's noise, nothing bad happens. In our case, our dupe points were measured over different time periods. So it's unclear that combining them is the right path. Even if we were to combine them, it's also not obvious what the "appropriate amount" of noise is since you don't have access to the original data anymore. Or am I getting confused on stats here?

Expanding on the reasonable path forward: 1) C++ to throw exceptions when it encounters singular matrices. I have this ticketed at yelp (ADS-3391). It's slightly subtle b/c a one multistart failing out of 100 is no big deal, but gaussian process creation failing is a big deal. 2) python optional "data sanity check" that can check for dupe points, no input data, etc. and warn the user appropriately. this can be done in C++ too but it's easier to make sure it only happens once in python & the code will be much easier to write/manipulate in python.

sc932 commented 10 years ago

Made issue #19 to deal with 2) above.

This issue still has the action items of throwing InputErrors for the two original cases talked about.

peter-i-frazier commented 10 years ago

Hi suntzu, yes, there's a standard trick that let's you combine two measurements of the same point, when errors are normal.

If the two measurement values are y(1) and y(2), and they have strictly positive variances 1/b(1) and 1/b(2), then under the assumption of independent normally distributed noise, this will give you the same result in Gaussian process regression as observing a single point with value:

y = (b(1)_y(1) + b(2)_y(2)) / (b(1)+b(2))

and variance:

1/(b(1) + b(2)).

In a perfect world, with infinite precision floating point, this gives you the same result as keeping the two points separate. My guess is that with finite precision floating point, it either has no effect, or helps you slightly, depending on the situation, though I don't really know.

If you wanted to use this trick in the code, it would reduce the dimension of the matrix you need to deal with in the GP regression, though of course the amount of the advantage depends on what fraction of points are duplicated.

I'll leave it up to you guys if you want to use this trick in the code or not.

Now, it should be said that this is only useful once you've already estimated the value of the sampling variances. When estimating the sampling variances, points should be kept separate, or at least other sufficient statistics need to be computed.

It should also be said that if we go to a more sophisticated statistical model with non-normal samples, then this trick no longer works.

-Peter

On Apr 3, 2014 5:26 PM, "suntzu86" notifications@github.com wrote:

I think both cases are user error. In the first case, why did it happen? Did the user mean to input a different point but made an error? Duplicating points (and what is dupe: exact match? within epsilon?) smells like a mistake, and I'd rather fail than swallow it and do something potentially surprising.

In the second case, 'impossible' data isn't obvious (but it is probable). The user might be running MOE w/o exposing all parameters to us (e.g., time). So they made precise measurements at 2 different times. They need to expose the hidden parameter or add 'fake' noise.

Lastly, having noise means these scenarios don't result in crashes (and can even be well-behaved numerically if the noise is large enough), but they still feel like questionable use cases.

Reply to this email directly or view it on GitHubhttps://github.com/sc932/MOE/issues/17#issuecomment-39506999 .

suntzu86 commented 10 years ago

Following up on the original discussion in this ticket, it is no longer possible to for the 'main' optimal_learning codepaths to throw LinAlgError. That's a numpy error; it arose previously from the Python implementation which is now totally separate from the C++ wrappers.

So we should delete special handling of that exception (and any other python-specific exceptions--how to find these?).

sc932 commented 10 years ago

So can't the python version still throw it even if the C++ won't?

To find all exception handling search python files for try/catch or raise.

suntzu86 commented 10 years ago

Yeah the python version can throw that but there also isn't much reason to hook the entire REST interface up to the python version.

Beyond that, my changes in #205 will obviate this issue.

suntzu86 commented 10 years ago

Closed by pull #212