Curious behavior of `model.set_data()` and `control_loop.get_next_points()`

ekalosak commented 4 years ago

Hi Emikit team,

First, thank you for your work on this package - it's a joy to use.

I'm writing with a question about some curious behavior I've observed when using the Bayesian optimization control loop. When I use the IModel.set_data(X, Y) class method to alter the model data followed by the OuterLoop.get_next_points(results), the model's data is reset to what it was before the set_data() call with an extra row representing the contents of the results object.

The expected behavior is to see, after the OuterLoop.get_next_points(results) call, the model data constituted by the X passed to set_data concatenated with the contents of results.

untitled (1)

Here's a minimal example that reproduces the behavior:

import numpy as np

from GPy.models import GPRegression
from GPy.kern import Matern52

from emukit.bayesian_optimization.acquisitions import ExpectedImprovement
from emukit.bayesian_optimization.loops import BayesianOptimizationLoop
from emukit.core import (
    ParameterSpace,
    DiscreteParameter,
)
from emukit.core.loop import UserFunctionWrapper
from emukit.model_wrappers import GPyModelWrapper

# Initial observations
X = np.array([[1,1,2],[2,1,2],[1,1,1]])
Y = np.array([[1],[2],[3]])

# Surrogate optimization components
kernel = Matern52(
    input_dim=X.shape[1],
    )
model_gpy = GPRegression(
    X=X,
    Y=Y,
    kernel=kernel,
    normalizer=True,
    )
model_emukit = GPyModelWrapper(
    gpy_model = model_gpy,
    )
parameters = [DiscreteParameter(f'param_{i}', range(10)) for i in range(X.shape[1])]
parameter_space = ParameterSpace(parameters)
acquisition_criterion = ExpectedImprovement(model = model_emukit)
f = lambda x_row: np.array([[sum(sum(x_row))]])
f_wrapped = UserFunctionWrapper(f)

control_loop = BayesianOptimizationLoop(
    model = model_emukit,
    space = parameter_space,
    acquisition = acquisition_criterion,
    )

# Just make sure that the data is actually represented in the model
assert model_emukit.model.X.shape[0] == X.shape[0]

# Try to set the data using other matrices
X2 = np.array([[3,3,3],[3,4,3]])
Y2 = np.array([[4],[5]])
model_emukit.set_data(X=X2, Y=Y2)

# The data is 'set' after running set_data()
assert model_emukit.model.X.shape[0] == X2.shape[0]

# Provide a result for some arbitrarily suggested point
X_arbitrary_suggestion = np.array([[1,2,5]])
results = f_wrapped(X_arbitrary_suggestion)
X_next = control_loop.get_next_points(
    results = results,
    )

# As a side effect of control_loop.get_next_points(), the model data is reset.
assert model_emukit.model.X.shape[0] == X.shape[0] + 1
for model_x_row, initial_x_row in zip(model_emukit.model.X, X):
    assert all(model_x_row == initial_x_row)

ekalosak commented 4 years ago

For those encountering the same issue, here's a functional work-around:

workaround_results = [UserFunctionResult(X=x, Y=y) for x, y in zip(X2,Y2)]
workaround_loop_state = LoopState(workaround_results)
control_loop.loop_state = workaround_loop_state

X_next = control_loop.get_next_points(
    results = results,
    )

# When the workaround is applied, model data are maintained as expected
assert model_emukit.model.X.shape[0] == X2.shape[0] + 1
for model_x_row, replacement_x_row in zip(model_emukit.model.X, X2):
    assert all(model_x_row == replacement_x_row)

mmahsereci commented 4 years ago

Hi @ekalosak , yes, the model is updated with the results that are stored in the loop state, at least if you use the appropriate model updater. Hence your workaround works in that case. For another (custom) model updater that does not use the loop state object it may not work.

I am curious, where do you need that functionality? I.e., that you replace the data of the model. I am asking because the active learning loop is usually used precisely to collect the data. if you replace it in the middle, you could have just started with that other data.

ekalosak commented 4 years ago

To address your curiosity: consider an experiment in which we have imprecise knowledge about the allowable discrete elements of the objective function's domain. What's more, we don't get the precise point in the domain associated with a particular experiment until some time after the primary experiment is performed.

An example might be a combinatorial material science application where certain a priori unknown configurations of material properties are impossible to fabricate, some desired properties are only possible approximately, etc. Our goal is to improve conductive properties, e.g., and the conductivity is easy to test so we get our measurements post-fabrication quickly. However, measuring the actual fabricated properties is difficult, takes time, and comes in batches because we sent samples in batches to an external lab. Note that these material properties are part of the design space, not the objective functions's co-domain.

It might be attractive to suggest multi-fidelity optimization, but doubling the number of free parameters seems problematic when we're shooting for sample-efficiency.

tl;dr the X data are generated with incomplete information about the allowable domain, so it's useful to be able to adjust the model data as we go when more precise information about the actually implemented X data becomes available.

ekalosak commented 4 years ago

To not get too off-track: does it make sense to have the model.set_data() be the definitive source of data? If so, I'm thinking of adding the loop_state as a paramz observer that updates when set_data is called... But I'm really not terribly sure what the best design is here, so any input would be appreciated.

Or, perhaps, the best design is just to isolate the data modification to the loop_state results and let the control_loop.get_next_points() do the model.set_data() call.

EmuKit / emukit

Curious behavior of `model.set_data()` and `control_loop.get_next_points()` #313