Why break upon bo._update_model() LinAlgError?

ekalosak commented 4 years ago

On https://github.com/SheffieldML/GPyOpt/blob/master/GPyOpt/core/bo.py#L138, the response to a LinAlgError in bo._update_model() is to break and end the optimization. This seems a little heavy-handed - exiting an optimization early for a hyperparameter optimization error doesn't sit right. What is the rationale behind this choice? Would retrying a couple times be a sensible alternative? Thanks for any/all input.

apaleyes commented 4 years ago

Don't know about the original thinking behind this line. But I traced the call stack back to Paramz package, and there it is only raised after a number of restarts (10 by default, and I don't believe this is changed anywhere). So it appears as though if you got this error GPyOpt level, a simple optimization restart won't help.

That's just a theory through. Do you have a code that runs into this exception? Would be nice to see if adding some simple retry logic would help.

ekalosak commented 4 years ago

Preliminary results confirm your hunch - I added a little retry loop

done=False
while not done:
  try:
    bo._update_model()
    done=True
  except LinAlgError as e:
    nfail+=1
    if nfail>5:
      raise e

and it did nothing to stop the error. It was admittedly a fast hack - the real solution seems to require some digging. My new approach is to prune the X matrix upon getting an error - the traceback points to set_XY in the GPy.gp.GP object. This appears to be the trigger for the paramz updatable object's update_model - such a long execution stack!

It's a sporadic error - bo.run_optimization() is sufficient to generate it on a discrete/categorical response surface, but you have to get unlucky to reproduce.

ekalosak commented 4 years ago

To reproduce, in any discrete space, simply let the optimization run long enough and you'll end up with a bo.X having sufficiently many duplicate rows (bo.X.shape[0]=110 and np.unique(bo.X,axis=0).shape[0]=30) to cause even jittered Choelsky decompositions to fail in GPy (https://github.com/SheffieldML/GPy/blob/devel/GPy/util/linalg.py#L75) NB: you need to make sure dedupe is off.

ekalosak commented 4 years ago

I've round a work-around, but I'm not sure if it's general enough to be a useful PR:

while True:  # outer optimization loop
  try:
    bo._update_model()
  except LinAlgError as e:
    bo.X, bo.Y = remove_dense_points_with_little_variation(bo.X, bo.Y)
    bo.X = explore_more_points_far_from_already_evaluated(bo.X)
    continue

This is particularly useful in spaces with no continuous dimensions because the next suggestion is snapped to the nearest lattice point, and depending on your acquisition function (EI, MPI are the worst offenders afaik), new suggestions are concentrated around 'very good' previous evaluations.

It might be better to leave it as it is, break, and say, as it does, "optimization complete". What do you think @apaleyes about introducing an explore_more_when_stuck: bool optional kwarg?

apaleyes commented 4 years ago

First of all, let me see if i understand this right. Are we essentially heading to "optimizaiton is stuck at a local extremum, let's jump elsewhere at random and start anew"? Or is it more subtle?

ekalosak commented 4 years ago

It is not more subtle, you hit the nail on the head. Admittedly, this is a blunt instrument. I'd love to move towards a more principled or correct solution, so any advice on that end would be greatly appreciated.

apaleyes commented 4 years ago

That's not necessarily a bad idea, at least as a first try. The API should probably say how many times we are allowed to jump. But it should be extensively tested, as I'd rather leave it as is then introduce some new parameters that just confuse the whole thing. I also not sure if the implementation is that simple, e.g. what should be done with acquisition function then?

@javiergonzalezh any thoughts?

ekalosak commented 4 years ago

Having implemented this solution privately and having used it in a few tens of thousands of optimization runs, I'm pretty happy with it - and the implementation is just that simple. Whether it's right is a totally separate question, of course.

Adding an API parameter with a default num_times_to_jump_when_stuck value isn't a huge lift.

I'm not sure I understand what needs to be done with the acquisition function in this case - there's no surrogate model fitting nor acquisition sub-optimization required if explore_more_points_far_from_already_evaluated is only contingent on the X data and the design space. Maybe I'm missing a subtlety with the parameter update triggers or something like that?

SheffieldML / GPyOpt

Why break upon bo._update_model() LinAlgError? #299