bayesian-optimization / BayesianOptimization

A Python implementation of global optimization with gaussian processes.
https://bayesian-optimization.github.io/BayesianOptimization/index.html
MIT License
7.9k stars 1.54k forks source link

simple question about selecting next x values in acquisition function #522

Closed Yoonho-Na closed 1 month ago

Yoonho-Na commented 1 month ago

hello I have question about setting the number of samples to get mu and std for calculating acquisition function. for example when we calculate acquisition function with Probability of Improvement, we use this kind of formula

cdf((mu-current_best)/std)

but I'm just curious about how densely do we need to set the x samples in the search space for calculating the mu and std. is there any guidelines for it or other method?

till-m commented 1 month ago

Hey @Yoonho-Na,

I think you might be a bit confused as to how the package works. mu and std are not statistical variables describing the data, but they're properties of the gaussian process at a specific point. You should not need to set them manually. What are you trying to do?

Yoonho-Na commented 1 month ago

@till-m I'm trying to implement bayesian optimization with neural process as surrogate. I haven't implemented BO from scratch before so I think I'm little confused with it.

At first I tried to modify the code from your source but It was little difficult so I followed the scheme from this link instead.

# probability of improvement acquisition function
def acquisition(X, Xsamples, model):
    # calculate the best surrogate score found so far
    yhat, _ = surrogate(model, X)
    best = max(yhat)
    # calculate mean and stdev via surrogate function
    mu, std = surrogate(model, Xsamples)
    mu = mu[:, 0]
    # calculate the probability of improvement
    probs = norm.cdf((mu - best) / (std+1E-9))
    return probs
# optimize the acquisition function
def opt_acquisition(X, y, model):
    # random search, generate random samples
    Xsamples = random(100)
    Xsamples = Xsamples.reshape(len(Xsamples), 1)
    # calculate the acquisition function for each sample
    scores = acquisition(X, Xsamples, model)
    # locate the index of the largest scores
    ix = argmax(scores)
    return Xsamples[ix, 0]

I also tried to find this kind of line in your code but I couldn't so I assumed that you are using some kind of method that gets X sample data.

but they're properties of the gaussian process at a specific point.

What do you mean by that specific point? I think this can be the clue. I'm currently using points that are kind of like this np.linspace(pbound[0], pbound[1], 100)

thank you for the reply

till-m commented 1 month ago

Hey @Yoonho-Na,

the way BO works is somewhat like this:

  1. You get some initial data
  2. You fit the GP (in your case the NP) to the data
  3. You construct an acquisition function that uses the GP model to evaluate which regions are particularly interesting to probe
  4. You find the maximum of your acquisition function. This point is often glossed over in BO. In this package, we use a combination of random sampling and quasi-Newton-optimization to find the maximum.
  5. you evaluate your target function on the found arg max of the acquisition function.
  6. Repeat steps 2-5 until a good enough point is found

Can you pinpoint what you're struggling with exactly?

Yoonho-Na commented 1 month ago

@till-m Don't we have to get the target distributions for each input parameter x, that covers the full pbound region when probing??

for example suppose that the task is finding the argmax value of sin(x), where pbound is 0 <= x <= pi. and what I did is, since the x is continuous and we cannot get every target distribution for continuous x, I discretized the possible input x into 101 bins just like below. [0, pi*1/100, pi*2/100, ..., pi] and by using the surrogate model y', we can obtain the target distribution for each input bins. So now that we have target distribution for each input bins so we can calculate acquisition function for each bins using cdf((mu - best) / (std+1E-9))

Is there anything that I'm misunderstanding?

I think my method is wrong because with this method the suggestion for next x is going to be restricted into the bins that I previously defined, if there are many parameters the bin size gets too much large because there can be so many combinations, and also the optimization is going to be highly differ with the number of bins.

till-m commented 1 month ago

Hey @Yoonho-Na,

Don't we have to get the target distributions for each input parameter x, that covers the full pbound region when probing?? Is there anything that I'm misunderstanding?

no, you don't have to evaluate every single parameter, that's not really feasible as you noticed. You can do what you're doing and just evaluate on a grid. It might be more prudent to use a random sampling method (uniform, latin hypercube or similar) to not be restricted to the gridpoints, but either way it's valid.

What this package does is a combination of random sampling and the newton method. See the code here (random sampling) and here (quasi-newton). The quasi-newton is much more expensive, but will usually lead to better results.

I hope that helps.

Yoonho-Na commented 1 month ago

@till-m Oh now I understand it clearly. Could you answer just one more simple question? I'm not an expert in BO but my instinct tells me that evenly distributed evaluation points are better for probing correctly. Of course sufficient amount of sampling points can be fine but I just want to make sure.. So I'm thinking of trying grid-based approach like before but with adding small amount of gaussian noise to that evaluation grid points. Do you think this is valid or is it worthless?

till-m commented 1 month ago

evenly distributed can mean many things. If you're talking about sampling the grid points, the problem is that you're restricting the solution space to these points. This becomes more relevant the more steps you have, if you sample multiple times, you will always see the same points. With random sampling this is not the case. if you're worried about covering the space unevenly, you can look into latin hypercube sampling or orthogonal sampling.

Yoonho-Na commented 1 month ago

evenly distributed can mean many things. If you're talking about sampling the grid points, the problem is that you're restricting the solution space to these points.

This is why I said adding gaussian noise to the grid points. so It can be almost evenly distributed grid but not having exact same points for every iterations.

So I'm thinking of trying grid-based approach like before but with adding small amount of gaussian noise to that evaluation grid points.

But I guess I'll just try using random sampling and quasi-newton at first like your advice.