emdgroup / baybe

Bayesian Optimization and Design of Experiments
https://emdgroup.github.io/baybe/
Apache License 2.0
266 stars 42 forks source link

DataFrame indexing for multi-task problems #297

Closed sgbaird closed 4 months ago

sgbaird commented 4 months ago

I'm guessing for a multi-task problem, BayBE concatenates the df such that some of the original indices are lost. See the reproducer below, which gives a recommended df with an index of 40, even though the highest index for lookup_test_task is 26. Note that SMOKE_TEST environment variable was set to true when I ran this. Is this intended behavior?

If I dig further into the stack trace of simulate_scenarios, I'm guessing I'd find some handling of that, but it wasn't immediately obvious to me.

import os

# import sys
# from pathlib import Path

import numpy as np
import pandas as pd

# import seaborn as sns
from botorch.test_functions.synthetic import Hartmann

from baybe import Campaign
from baybe.objectives import SingleTargetObjective
from baybe.parameters import NumericalDiscreteParameter, TaskParameter
from baybe.searchspace import SearchSpace
from baybe.targets import NumericalTarget
from baybe.utils.botorch_wrapper import botorch_function_wrapper

from baybe.utils.random import set_random_seed

set_random_seed(42)

# Configuration
SMOKE_TEST = "SMOKE_TEST" in os.environ
DIMENSION = 3
BATCH_SIZE = 1
N_MC_ITERATIONS = 2 if SMOKE_TEST else 50
N_DOE_ITERATIONS = 2 if SMOKE_TEST else 10
POINTS_PER_DIM = 3 if SMOKE_TEST else 5
NUM_INIT = 2 if SMOKE_TEST else 5

# Define objective and search space
objective = SingleTargetObjective(target=NumericalTarget(name="Target", mode="MIN"))
BOUNDS = Hartmann(dim=DIMENSION).bounds

discrete_params = [
    NumericalDiscreteParameter(
        name=f"x{d}",
        values=np.linspace(lower, upper, POINTS_PER_DIM),
    )
    for d, (lower, upper) in enumerate(BOUNDS.T)
]

task_param = TaskParameter(
    name="Function",
    values=["Test_Function", "Training_Function"],
    active_values=["Test_Function"],
)

parameters = [*discrete_params, task_param]
searchspace = SearchSpace.from_product(parameters=parameters)

# Define test functions
test_functions = {
    "Test_Function": botorch_function_wrapper(Hartmann(dim=DIMENSION)),
    "Training_Function": botorch_function_wrapper(
        Hartmann(dim=DIMENSION, negate=True, noise_std=0.15)
    ),
}

# Generate lookup tables
grid = np.meshgrid(*[p.values for p in discrete_params])
lookups = {}
for function_name, function in test_functions.items():
    lookup = pd.DataFrame({f"x{d}": grid_d.ravel() for d, grid_d in enumerate(grid)})
    lookup["Target"] = lookup.apply(function, axis=1)
    lookup["Function"] = function_name
    lookups[function_name] = lookup
lookup_training_task = lookups["Training_Function"]
lookup_test_task = lookups["Test_Function"]

# Perform the transfer learning campaign
campaign = Campaign(searchspace=searchspace, objective=objective)
initial_data = lookup_training_task.sample(n=NUM_INIT)

campaign.add_measurements(initial_data)

df = campaign.recommend(batch_size=BATCH_SIZE)

# print(df)
#         x0   x1   x2       Function
# index
# 40     1.0  0.0  1.0  Test_Function

# NOTE: Indices other than the ones from df are ignored
df["Target"] = lookup_test_task.iloc[df.index]["Target"]

campaign.add_measurements(df)

# NOTE: with SMOKE_TEST and 42 as seed:
# print(lookup_test_task)
#      x0   x1   x2    Target       Function
# 0   0.0  0.0  0.0 -0.067974  Test_Function
# 1   0.0  0.0  0.5 -0.136461  Test_Function
# 2   0.0  0.0  1.0 -0.091332  Test_Function
# 3   0.5  0.0  0.0 -0.097108  Test_Function
# 4   0.5  0.0  0.5 -0.185407  Test_Function
# 5   0.5  0.0  1.0 -0.090204  Test_Function
# 6   1.0  0.0  0.0 -0.030955  Test_Function
# 7   1.0  0.0  0.5 -0.072904  Test_Function
# 8   1.0  0.0  1.0 -0.084769  Test_Function
# 9   0.0  0.5  0.0 -0.018048  Test_Function
# 10  0.0  0.5  0.5 -0.839061  Test_Function
# 11  0.0  0.5  1.0 -1.994263  Test_Function
# 12  0.5  0.5  0.0 -0.025729  Test_Function
# 13  0.5  0.5  0.5 -0.628022  Test_Function
# 14  0.5  0.5  1.0 -1.957039  Test_Function
# 15  1.0  0.5  0.0 -0.008194  Test_Function
# 16  1.0  0.5  0.5 -0.225915  Test_Function
# 17  1.0  0.5  1.0 -1.826650  Test_Function
# 18  0.0  1.0  0.0 -0.000274  Test_Function
# 19  0.0  1.0  0.5 -2.262308  Test_Function
# 20  0.0  1.0  1.0 -0.334829  Test_Function
# 21  0.5  1.0  0.0 -0.000204  Test_Function
# 22  0.5  1.0  0.5 -1.485659  Test_Function
# 23  0.5  1.0  1.0 -0.325958  Test_Function
# 24  1.0  1.0  0.0 -0.000038  Test_Function
# 25  1.0  1.0  0.5 -0.224631  Test_Function
# 26  1.0  1.0  1.0 -0.300476  Test_Function
Exception has occurred: IndexError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
positional indexers are out-of-bounds
  File "C:\Users\sterg\miniforge3\envs\baybe\Lib\site-packages\pandas\core\indexing.py", line 1714, in _get_list_axis
    return self.obj._take_with_is_copy(key, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\sterg\miniforge3\envs\baybe\Lib\site-packages\pandas\core\generic.py", line 4153, in _take_with_is_copy
    result = self.take(indices=indices, axis=axis)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\sterg\miniforge3\envs\baybe\Lib\site-packages\pandas\core\generic.py", line 4133, in take
    new_data = self._mgr.take(
               ^^^^^^^^^^^^^^^
  File "C:\Users\sterg\miniforge3\envs\baybe\Lib\site-packages\pandas\core\internals\managers.py", line 891, in take
    indexer = maybe_convert_indices(indexer, n, verify=verify)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\sterg\miniforge3\envs\baybe\Lib\site-packages\pandas\core\indexers\utils.py", line 282, in maybe_convert_indices
    raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds

The above exception was the direct cause of the following exception:

  File "C:\Users\sterg\miniforge3\envs\baybe\Lib\site-packages\pandas\core\indexing.py", line 1717, in _get_list_axis
    raise IndexError("positional indexers are out-of-bounds") from err
  File "C:\Users\sterg\miniforge3\envs\baybe\Lib\site-packages\pandas\core\indexing.py", line 1743, in _getitem_axis
    return self._get_list_axis(key, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\sterg\miniforge3\envs\baybe\Lib\site-packages\pandas\core\indexing.py", line 1191, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\sterg\Documents\GitHub\sgbaird\baybe\examples\Transfer_Learning\index_reproducer.py", line 81, in <module>
    df["Target"] = lookup_test_task.iloc[df.index]["Target"]
                   ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "C:\Users\sterg\miniforge3\envs\baybe\Lib\runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "C:\Users\sterg\miniforge3\envs\baybe\Lib\runpy.py", line 198, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: positional indexers are out-of-bounds

xref: https://github.com/emdgroup/baybe/discussions/257 and https://github.com/emdgroup/baybe/discussions/283

NOTE: baybe==0.9.1.post209, Windows 11

sgbaird commented 4 months ago

Here is my workaround:

parameter_names = [p.name for p in searchspace.parameters]
obj_name = "Target"

full_lookup = pd.concat([lookup_training_task, lookup_test_task], ignore_index=True)

df = campaign.recommend(batch_size=BATCH_SIZE)

# update the dataframe with the target value(s)
df = pd.merge(
    df, full_lookup[parameter_names + [obj_name]], on=parameter_names, how="left"
)
AdrianSosic commented 4 months ago

Hi @sgbaird, I think it could be that you are mixing up unrelated things here. Note: you lookup_test_task merely holds the lookup data to close the DOE loop – it exists completely independently of your campaign and does not enter it in any way. They are only connected in the sense that you can use it to look up values that your campaign recommends, but the campaign isn't even aware that this objects exists and could in fact do its work without it. Thus, it's impossible that the recommendations of the campaign can have anything in common with the lookup, e.g. there is no reason to even assume that they would share indices or similar.

The indices you see returned by the campaign refer to the dataframe that is internally created to represent the discrete search space of the problem. But that is a completely arbitrary choice. In fact, I would even argue that the indices could be ignored entirely. We simply used the search space indices because we have to use an index for pandas DataFrames, and this at least gives us a reference to which search space elements have been recommended (compared to the alternative were we would simply start enumerating from 1).

Does it answer your question?

sgbaird commented 4 months ago

Ah, got it. So my "workaround" above is actually the correct way to do it - look for a matching configuration.

I guess part of the confusion is that the lookup tables were created based on the allowed search space values, but I see that is an arbitrary choice for this example.

Let's take the case where the training data is sampled within a continuous search space, and there isn't a particular pattern to the parameter sets. As a concrete example for a 1D problem.

training_data = [{"x": 0.43, "y": 1.86}, ... {"x": 0.78, "y": 2.3}]

and we say that the test function can only be sampled at x = [0, 0.5, 1.0], what is the correct way to set this up with BayBE's API?

AdrianSosic commented 4 months ago

Setup

import pandas as pd

from baybe import Campaign
from baybe.parameters import NumericalContinuousParameter, NumericalDiscreteParameter
from baybe.recommenders import BotorchRecommender
from baybe.searchspace import SearchSpace
from baybe.targets import NumericalTarget

parameters = [
    NumericalDiscreteParameter("x", [0.0, 0.5, 1.0]),
    NumericalContinuousParameter("y", (0, 1)),
]
searchspace = SearchSpace.from_product(parameters)
objective = NumericalTarget("t", mode="MAX").to_objective()
recommender = BotorchRecommender()
measurements = pd.DataFrame.from_records(
    [
        {"x": 0.43, "y": 0.55, "t": 1.86},
        {"x": 0.78, "y": 0.98, "t": 2.3},
    ]
)

Using Recommender Directly

rec = recommender.recommend(5, searchspace, objective, measurements)
print(rec)

Via Campaign

campaign = Campaign(searchspace, objective, recommender)
campaign.add_measurements(measurements, numerical_measurements_must_be_within_tolerance=False)
rec = campaign.recommend(5)
print(rec)

Note: For the latter, you currently need to explicitly specify the numerical_measurements_must_be_within_tolerance flag, since your measurements strictly speaking lie outside the range of the parameter you specified. However, we are currently still working on this interface, though, exactly because the behavior is not yet perfectly consistent between the two approaches and because the "tolerance" logic needs to be revised in general. #workinprogress

AdrianSosic commented 4 months ago

Hi @sgbaird, note that I've just updated the imports in my code above (which were a bit messy in the original version). Other than that, everything remains the same. I'll close the issue now, but feel free to reopen if further discussion is necessary ✌🏼