bayesian-optimization / BayesianOptimization

A Python implementation of global optimization with gaussian processes.
https://bayesian-optimization.github.io/BayesianOptimization/index.html
MIT License
7.96k stars 1.55k forks source link

Proper parallel optimisation #347

Closed Rizhiy closed 4 months ago

Rizhiy commented 2 years ago

I have a single threaded function which I want to optimise. I'm trying to write a wrapper that would handle multiple runs at the same time, but I'm noticing considerable degradation in results as I increase the number of parallel evaluations.

Here is the rough logic of what I'm doing:

from __future__ import annotations

from pprint import pprint
from typing import Callable

import numpy as np
from bayes_opt.bayesian_optimization import BayesianOptimization
from bayes_opt.util import UtilityFunction
from tqdm import tqdm, trange

def multivariable_func(r: float, x: float, y: float, diff: float) -> float:
    r = int(r)
    diff = diff > 0.5

    loss = (r - 5) ** 2
    loss += (x**2 + y**2 - r) ** 2
    loss += abs(x - y) * (-1) ** int(diff)
    loss += 0.5 * x
    loss += -0.25 * int(diff)
    return -loss

def optimize(func: Callable[..., float], num_iter: int, bounds: dict[str, tuple[float, float]], num_workers=0):
    init_samples = int(np.sqrt(num_iter))

    optimizer = BayesianOptimization(f=None, pbounds=bounds, verbose=0)
    init_kappa = 10
    kappa_decay = (0.1 / init_kappa) ** (1 / (num_iter - init_samples))
    utility = UtilityFunction(
        kind="ucb", kappa=init_kappa, xi=0.0, kappa_decay=kappa_decay, kappa_decay_delay=init_samples
    )

    init_queue = [optimizer.suggest(utility) for _ in range(init_samples)]
    result_queue = []
    tbar = tqdm(total=num_iter, leave=False)
    while len(optimizer.res) < num_iter:
        sample = init_queue.pop(0) if init_queue else optimizer.suggest(utility)
        loss = func(**sample)
        result_queue.append((sample, loss))
        if len(result_queue) >= num_workers:
            try:
                optimizer.register(*result_queue.pop(0))
                utility.update_params()
                tbar.update()
            except KeyError:
                pass
    return optimizer.max

bounds = {"r": [-10, 10], "x": [-10, 10], "y": [-10, 10], "diff": [0, 1]}

all_results = {}
for num_workers in tqdm([1, 2, 4, 8], desc="Checking num_workers"):
    results = []
    for idx in trange(2, desc=f"Sampling with {num_workers=}"):
        best = optimize(multivariable_func, 400, bounds, num_workers)
        results.append(best["target"])
    all_results[num_workers] = np.mean(results)
    tqdm.write(f"Result for optimizing with {num_workers=}: {all_results[num_workers]}")
print("\n")
pprint(all_results)

The result_queue variable is simulating evaluation across multiple processes.

Here are the results:

{1: 4.320798413579277,
 2: 4.320676522735756,
 4: 3.5379530743926133,
 8: 2.175667857740832}

As can be seen, the more processes I use, the worse the final result is. I don't understand why that would happen, even if a couple fewer suggestions are evaluated properly, the results should not differ so much.

What am I missing?

bwheelz36 commented 2 years ago

Hi

Just to make sure I understand: Essentially the goal is to collect num_workers results, then update the GP model with all collected results? And in this example, nothing is actually happening in parallel - but the point is it could be - is that correct?

It appears to me that the issue arises when you register the results: optimizer.register(*result_queue.pop(0)) This line only registers one of the n_worker results? so essentially for four workers, you are throwing away 75% of the information... you can check the registered data with optimizer.res

Also, you seem to be using the kappa decay functionality to generate an init_queue of init_samples length - but since at this point there is no data I guess this is just generating random points? which is fine as long as you are aware of this! (actually, I quite like this approach of leveraging the kappa decay to generate multiple suggestions, relevant for #327 and #340)

Apart from this - have you searched these issues for 'parallel'? I don't have much (well any) experience running this library in parallel but it appears that there are some other examples out there...

Rizhiy commented 2 years ago

That is not exactly what is happening. Rather I have a pool of workers and when one of them finishes, I register its result and generate a new suggestion for it to evaluate. The number of suggestions is the same, but the registration is delayed by num_workers.

The loop depends on the number of registered results, so it always checks the same number of suggestions.

As explained in other issues, generating multiple suggestions at the same time doesn't work, since the algorithm is deterministic. That is why a new observation is always registered before asking for a new suggestion.

bwheelz36 commented 2 years ago

"As explained in other issues, generating multiple suggestions at the same time doesn't work, since the algorithm is deterministic."

This is correct if kappa doesn't change, but you can certainly obtain n suggestions with n different kappas.

"The number of suggestions is the same, but the registration is delayed by num_workers."

Apologies, I misinterpreted your code. I've had another look at this and I think the result you are seeing my actually be variation as the result of the random state not being set. can you try again with optimizer = BayesianOptimization(f=None, pbounds=bounds, verbose=0, random_state=1)?

I created a simplified version of your example, and I get consistent results when random_state is set:


from bayes_opt.bayesian_optimization import BayesianOptimization
from bayes_opt.util import UtilityFunction
from scipy.optimize import rosen
from sklearn.gaussian_process.kernels import Matern

def multi_worker_optimize(num_iter: int, bounds: dict[str, tuple[float, float]], num_workers=0):
    # set random state should ensure consistent results
    optimizer = BayesianOptimization(f=None, pbounds=bounds, verbose=0, random_state=1)
    utility = UtilityFunction(kind="ucb", kappa=5, xi=0.0)
    # this line should improve the GP fitting:
    optimizer.set_gp_params(kernel=Matern(length_scale=[0.5, 0.5]), n_restarts_optimizer=20)

    result_queue = []
    while len(optimizer.res) < num_iter:
        sample = optimizer.suggest(utility)
        loss = -1*rosen(list(sample.values()))
        result_queue.append((sample, loss))
        if len(result_queue) >= num_workers:
            try:
                optimizer.register(*result_queue.pop(0))
                utility.update_params()
            except KeyError as e:
               pass
    return optimizer

bounds = {'x': (-5, 5), 'y': (-5, 5)}
n_iter = 100
# test with multi worker approach
for n_workers in [1, 2, 3, 4]:
    multi_opt = multi_worker_optimize(n_iter, bounds, 1)
    print(f'Pooled worker approach with {n_workers} workers: best value of {multi_opt.max["target"]}')
Rizhiy commented 2 years ago

I've had another look at this and I think the result you are seeing my actually be variation as the result of the random state not being set. can you try again with optimizer = BayesianOptimization(f=None, pbounds=bounds, verbose=0, random_state=1)?

I tried that, it doesn't make a difference. Moreover, I am not talking about consistent results, I'm talking about optimisation successfully finding the values close to global optimum. That should be the case no matter what the initial seed is. I'm also running it multiple times to remove the possibility of lucky guesses.

I created a simplified version of your example, and I get consistent results when random_state is set:

If you run the code you provided, then it is expected that you receive the same result since you fixed the seed and n_workers is not being used. Also, it is not very close to global optimum with 100 iterations, I only get -0.15534 which is not satisfactory for rosen, it just found the valley. I also run a fixed version of your code, with n_workers being used and results are as expected:

Pooled worker approach with 1 workers: best value of -0.15533752893668623
Pooled worker approach with 2 workers: best value of -0.9938185965332633
Pooled worker approach with 3 workers: best value of -23.314667307827033
Pooled worker approach with 4 workers: best value of -0.5002019396699882
bwheelz36 commented 2 years ago

"n_workers is not being used."

oops - good point - I was playing around with that and forgot to change it back.

I've had another look at this, again using the rosen function with the n_workers parameter fixed.

The below plot shows the true rosen function, the rosen function predicted by the gaussian process model after being trained for 200 iterations, the GP standard deviation, and the absolute difference between true and predicted. I'll post the code for this below, but note that in the valley region both uncertainty (SD) and absolute difference are low. You can do similar things for your multi variable function but obviously visualization is not as straightforward. The two points shown are the true maximum and the point found by the optimizer.

rosen_

I only get -0.15534 which is not satisfactory for rosen, it just found the valley.

:man_shrugging: this is with kappa=10, so the algorithm will tend to prioritize exploring over finding the global minimum. If you put the kappa decay back in I get 0.0096. But in general this algorithm is good at exploring and not getting stuck, but not necessarily as good at honing in on local minima.

from scipy.optimize import rosen
from matplotlib import pyplot as plt
import numpy as np

def plot_rosen(optimizer, bounds):
    """
    compare the optimizer gaussian process predictor with the true function 

    :param optimizer: an instance of  bayes_opt.bayesian_optimization.BayesianOptimization post training
    :param bounds: the bounds dict passed to BayesianOptimization
    """

    extent = (bounds['x'][0], bounds['x'][1], bounds['y'][0], bounds['y'][1])
    x = np.linspace(bounds['x'][0], bounds['x'][1], 100)
    y = np.linspace(bounds['y'][0], bounds['y'][1], 100)
    [X, Y] = np.meshgrid(x, y)
    x_sample = np.array([X.flatten(), Y.flatten()])
    real_rosen = -1*rosen(x_sample)
    real_rosen = np.reshape(real_rosen, X.shape)

    # now get the _gp estimate
    predicted_rosen, predicted_rosen_std = optimizer._gp.predict(x_sample.T, return_std=True)
    predicted_rosen = np.reshape(predicted_rosen, X.shape)
    predicted_rosen_std = np.reshape(predicted_rosen_std, X.shape)

    fig, axs = plt.subplots(nrows=2, ncols=2, figsize=[12, 5])
    im1 = axs[0, 0].imshow(real_rosen, vmin=-100, vmax=0, extent=extent, origin='lower')
    axs[0, 0].scatter(0,0)
    axs[0, 0].scatter(optimizer.max['params']['x'], optimizer.max['params']['y'])
    axs[0, 0].set_title('real rosen')
    axs[0, 0].set_xlabel('x')
    axs[0, 0].set_ylabel('y')
    plt.colorbar(im1, ax=axs[0, 0])

    im2 = axs[0, 1].imshow(predicted_rosen, vmin=-100, vmax=0,  extent=extent, origin='lower')
    axs[0, 1].scatter(0,0)
    axs[0, 1].scatter(optimizer.max['params']['x'], optimizer.max['params']['y'])
    axs[0, 1].set_title('GP predicted rosen')
    axs[0, 1].set_xlabel('x')
    axs[0, 1].set_ylabel('y')
    plt.colorbar(im2, ax=axs[0, 1])

    im2 = axs[1, 0].imshow(predicted_rosen_std, vmin=0, vmax=500, extent=extent, origin='lower')
    axs[1, 0].scatter(0,0)
    axs[1, 0].scatter(optimizer.max['params']['x'], optimizer.max['params']['y'])
    axs[1, 0].set_title('GP standard deviation')
    axs[1, 0].set_xlabel('x')
    axs[1, 0].set_ylabel('y')
    plt.colorbar(im2, ax=axs[1, 0])

    im3 = axs[1, 1].imshow(abs(real_rosen - predicted_rosen), vmin=0, vmax=100, extent=extent, origin='lower')
    axs[1, 1].set_title('abs difference')
    axs[1, 1].set_xlabel('x')
    axs[1, 1].set_ylabel('y')
    axs[1, 1].scatter(0,0)
    axs[1, 1].scatter(optimizer.max['params']['x'], optimizer.max['params']['y'])
    plt.colorbar(im3, ax=axs[1, 1])
    plt.tight_layout()
    plt.show()
Rizhiy commented 2 years ago

OK, thank you. I've done some tests and here are the results for my example:

num_iter\num_worker 1 2 4 8 16
400 4.220 4.233 3.893 2.705 -1.569
500 4.320 4.316 4.106 3.357 2.101
600 4.315 4.320 4.257 4.131 3.511
700 4.321 4.320 4.318 4.304 3.880
800 4.321 4.321 4.320 4.309 4.145
900 4.320 4.320 4.320 4.310 4.182

These are averaged over 10 runs with different seeds to remove the effect of lucky guesses.

In this toy example, I would say anything above 4.3 is satisfactory. 8 workers require 700 iterations instead of 500, which equates to 470% speed-up. Currently, this is fine for me, so I will stop working on analysing this for now.

bwheelz36 commented 2 years ago

One other thought - I'm not sure if this will help or not, but possibly a 'flatter' objective function (perhaps by taking the log or even just capping the minimum value) may perform better - the very large gradients involved with these high range objective functions might 'confuse' the algorithm. I'm not sure if that's the case or not but it's something you could check.

Rizhiy commented 2 years ago

I might have an idea of why performance degrades. I think if there are too many points evaluated concurrently, the algorithm might propose a very similar point to one of those which are in queue. This will essentially reduce effective number of iterations. I will run a few experiments to check if this is indeed the case.

Rizhiy commented 2 years ago

I've finally run the experiments and the results are as I expected. Here is the figure of samples with different number of workers: results As can be seen, when the number of workers increase, points start to clump together, effectively reducing real number of iterations. Similarly, if we calculate average distance to the nearest point, it decreases:

num_workers=1, mean_min_distance=0.346
num_workers=2, mean_min_distance=0.171
num_workers=4, mean_min_distance=0.158
num_workers=8, mean_min_distance=0.128

To fix this, I suggest that when a sample is suggested and put into the queue the GP STD at that point is set to 0, so that it doesn't suggest new points around that value. I guess this can be done by first registering a fake point and then replacing it with the real result, but I was wondering whether there is a better approach.

@bwheelz36 Any idea how this is best accomplished?

Rizhiy commented 2 years ago

Here is the code I used for the experiment, just in case:

from __future__ import annotations

from pprint import pprint
from typing import Callable

import matplotlib.pyplot as plt
import numpy as np
from bayes_opt.bayesian_optimization import BayesianOptimization
from bayes_opt.util import UtilityFunction
from pandas.core.generic import pickle
from scipy.optimize import rosen
from tqdm import tqdm, trange

def _closest_distance(point, points):
    return min(np.linalg.norm(point - p) for p in points if p is not point)

def optimize(
    func: Callable[..., float], num_iter: int, bounds: dict[str, tuple[float, float]], num_workers=0
):
    init_samples = int(np.sqrt(num_iter))

    optimizer = BayesianOptimization(f=None, pbounds=bounds, verbose=0)
    init_kappa = 10
    kappa_decay = (0.1 / init_kappa) ** (1 / (num_iter - init_samples))
    utility = UtilityFunction(
        kind="ucb", kappa=init_kappa, xi=0.0, kappa_decay=kappa_decay, kappa_decay_delay=init_samples
    )

    init_queue = [optimizer.suggest(utility) for _ in range(init_samples)]
    result_queue = []
    tbar = tqdm(total=num_iter, leave=False)
    while len(optimizer.res) < num_iter:
        sample = init_queue.pop(0) if init_queue else optimizer.suggest(utility)
        loss = func(list(sample.values())) * -1
        result_queue.append((sample, loss))
        if len(result_queue) >= num_workers:
            try:
                optimizer.register(*result_queue.pop(0))
                utility.update_params()
                tbar.update()
            except KeyError:
                pass
    return optimizer.res

bounds = {"x": [-5, 5], "y": [-5, 5]}

all_results = {}
for num_workers in tqdm([1, 2, 4, 8, 16], desc="Checking num_workers"):
    results = []
    results = optimize(rosen, 200, bounds, num_workers)
    samples = [res["params"] for res in results]
    all_results[num_workers] = samples
with open("results.pkl", "wb") as f:
    pickle.dump(all_results, f)
with open("results.pkl", "rb") as f:
    all_results = pickle.load(f)

fig, axs = plt.subplots(2, 2)
fig.set_figheight(8)
fig.set_figwidth(8)
axs = [item for sublist in axs for item in sublist]
for idx, (num_workers, samples) in enumerate(all_results.items()):
    if num_workers > 8:
        continue
    samples = [np.array(list(sample.values())) for sample in samples]
    axs[idx].scatter(*zip(*samples), s=1)
    axs[idx].set_title(f"{num_workers=}")
    avg_min_distance = np.mean([_closest_distance(sample, samples) for sample in samples])
    print(f"{num_workers=}, mean_min_distance={avg_min_distance:.3f}")
fig.tight_layout()
fig.savefig("results.png")
bwheelz36 commented 2 years ago

Hey, thanks for following this up, I'll reopen this issue.

Rizhiy commented 2 years ago

I'm a bit surprised by your figure above. I had understood this that apart from the first n_worker points, this approach should function pretty normally because a suggestion is never made until a previous point has been evaluated and registered... but this is clearly not what is happening; in the n_workers=8 case, it's not just the first 8 guesses that aren't very good but also all subsequent guesses. What am I missing here? It looks more like the code is tending to produce clusters of similar points, which isn't what I would expect to happen...

I think the problem is that when using ucb, there is likely a single point which satisfies the requirements. Unless a registered observation modifies the surface such that best-ucb changes, a similar point will keep being suggested. From my understanding, for observation to have such effect, the result must be close to current best, which happens rarely.

re: setting std to zero (or at least a low number) - this would be one way to do it, but I would be a bit nervous to mess around with the GP like this, mostly because I don't understand the math well enough. The SD at any given point might also incorporate a noise kernel for instance... I'm sure it can be done somewhat robustly but it sounds less-than-ideal

Ok, so is it possible to "de-register" an observation? I can quite easily register a dummy observation when a sample is suggested and then de-register, re-register once the actual result is available.

A different approach might be to evaluate the acquisition function with different values of kappa, so that at each request, you may get a highly exploitative or highly exploratory suggestion or anything in between. we could for instance implement something like

Might be a good alternative approach, but I would like to try improving the quality of suggestions first.

Rizhiy commented 2 years ago

I looked at the code and it seems that it should be rather trivial to implement de-register/re-register functionality since GP is fitted anew for each suggestion. Will try to make an MR.

Rizhiy commented 2 years ago

I think I have a working solution: #365 Testing code:

from __future__ import annotations

from pprint import pprint
from typing import Callable

import matplotlib.pyplot as plt
import numpy as np
from bayes_opt.bayesian_optimization import BayesianOptimization
from bayes_opt.util import UtilityFunction
from pandas.core.generic import pickle
from scipy.optimize import rosen
from tqdm import tqdm, trange

def _closest_distance(point, points):
    return min(np.linalg.norm(point - p) for p in points if p is not point)

def optimize(
    func: Callable[..., float], num_iter: int, bounds: dict[str, tuple[float, float]], num_workers=0, with_dummies=False
):
    init_samples = int(np.sqrt(num_iter))

    optimizer = BayesianOptimization(f=None, pbounds=bounds, verbose=0)
    init_kappa = 10
    kappa_decay = (0.1 / init_kappa) ** (1 / (num_iter - init_samples))
    utility = UtilityFunction(
        kind="ucb", kappa=init_kappa, xi=0.0, kappa_decay=kappa_decay, kappa_decay_delay=init_samples
    )

    init_queue = [optimizer.suggest(utility) for _ in range(init_samples)]
    result_queue = []
    tbar = tqdm(total=num_iter, leave=False)
    while len(optimizer.res) < num_iter:
        sample = init_queue.pop(0) if init_queue else optimizer.suggest(utility)
        if with_dummies and num_workers > 1:
            optimizer.register_dummy(sample, -1)
        loss = func(list(sample.values())) * -1
        result_queue.append((sample, loss))
        if len(result_queue) >= num_workers:
            try:
                optimizer.register(*result_queue.pop(0))
                utility.update_params()
                tbar.update()
            except KeyError:
                pass
    return optimizer.res

bounds = {"x": [-5, 5], "y": [-5, 5]}

all_results = {}
for num_workers in tqdm([1, 2, 4, 8], desc="Checking num_workers"):
    results = []
    results = optimize(rosen, 150, bounds, num_workers, with_dummies=True)
    all_results[num_workers] = results
with open("results.pkl", "wb") as f:
    pickle.dump(all_results, f)
with open("results.pkl", "rb") as f:
    all_results = pickle.load(f)

fig, axs = plt.subplots(2, 2)
fig.set_figheight(8)
fig.set_figwidth(8)
axs = [item for sublist in axs for item in sublist]
for idx, (num_workers, results) in enumerate(all_results.items()):
    if num_workers > 8:
        continue
    samples = [np.array(list(res["params"].values())) for res in results]
    axs[idx].scatter(*zip(*samples), s=1)
    axs[idx].set_title(f"{num_workers=}")
    avg_min_distance = np.mean([_closest_distance(sample, samples) for sample in samples])
    best_result = max([res["target"] for res in results])
    print(f"{num_workers=}, mean_min_distance={avg_min_distance:.3f}, {best_result=:.3f}")
fig.tight_layout()
fig.savefig("results.png")

Results without change (with_dummies=False):

num_workers=1, mean_min_distance=0.465, best_result=-0.005
num_workers=2, mean_min_distance=0.228, best_result=-0.008
num_workers=4, mean_min_distance=0.157, best_result=-0.654
num_workers=8, mean_min_distance=0.184, best_result=-0.974

results

Results with change (with_dummies=True):

num_workers=1, mean_min_distance=0.471, best_result=-0.003
num_workers=2, mean_min_distance=0.483, best_result=-0.002
num_workers=4, mean_min_distance=0.514, best_result=-0.002
num_workers=8, mean_min_distance=0.506, best_result=-0.031

results

As can be seen there is quite a big improvement. Although, it seems there are still a couple of problems:

bwheelz36 commented 2 years ago

looks really promising! Sorry I think it is going to take me a few days to look at this properly though :-(

cestpasphoto commented 10 months ago

Wouldn't it be better to set the target of dummy points to what regressor would expect? I mean register_dummy() should use optimizer._gp.predict(params) as target.

till-m commented 10 months ago

@cestpasphoto shameless plug, but in PR #447 you can find this under the name KrigingBeliever and the version proposed here under the name ConstantLiar. It's not merged currently, but you can install from my branch (if you do, please let me know if you have any feedback).

That being said, during my limited testing I found both methods to work well, though I read in a paper that the lying seems to perform better, if you manage to choose a suitable value.