emdgroup / baybe

Bayesian Optimization and Design of Experiments
https://emdgroup.github.io/baybe/
Apache License 2.0
212 stars 34 forks source link

Recommendations taking a long time #192

Closed brandon-holt closed 2 months ago

brandon-holt commented 3 months ago

I'm wondering if the recommendation times I'm encountering are expected given my setup:

Machine: MacBook Air, 15 inch, M2, 2023 Memory: 16 GB OS: Sonoma 14.4.1 Python: 3.11.8

Model: Single NumericalTarget Parameters: 4 SubstanceParameters (~140 total SMILES molecules), 4 NumericalContinuousParameters Constraints: 4 numerical parameters must sum to 1.0 Recommender: TwoPhaseMetaRecommender( initial_recommender=RandomRecommender(),
recommender=SequentialGreedyRecommender())

So when I add 1000 datapoints via campaign.add_measurements() it takes ~4 days to make a recommendation with a batch size of 3. I started a test with only 10 datapoints and it is still running from overnight.

Does this sound expected given my machine, model, and data? If so, what would be the recommended ways to improve the speed? For the molecules I've tried with and without mordred & decorrelation, doesn't seem to make a big difference.

If this doesn't sound expected, how would you recommend I troubleshoot what could be causing the issue?

Thanks in advance!

Scienfitz commented 3 months ago

Hi @brandon-holt

the long time with 1000 data points could be caused by running into memory limit, I imagine that possible on a 16GB machine as it might write stuff temporarily to disk (super slow). but its weird if it also happens for only 10 added datapoints. Did you monitor memory after you start requesting recommendations?

Theres also the possibility that you've constructed a gigantic searchspace (not in feature dimension but in number of combinations). Please provide the way you construct it. I'd also be interested in the dimensionality of your searchspace obtained via campaign_object.searchspace.discrete.comp_rep and campaign_obejct.searchspace.discrete.exp_rep and

To investigate whether the surrogate model choice is causing this you could use a more scaling friendly random forest model (here done for a ngboost model https://emdgroup.github.io/baybe/examples/Custom_Surrogates/surrogate_params.html but you can easily change it to RandomForestSurrogate)

brandon-holt commented 3 months ago

Hi @Scienfitz thanks for the quick reply!

I suspect memory isn't the issue because memory stays far below my capacity (usually ~2 GB) throughout the duration of the recommendation.

I construct the searchspace like this: SearchSpace.from_product(parameters, constraints)

comp_rep produces a table with 43,740 rows x 194 columns exp_rep produces a table with 43,740 rows x 4 columns

Also, when I try to use a RandomForestSurrogate, I get the following error: NotImplementedError: Continuous search spaces are currently only supported by GPs.

Scienfitz commented 3 months ago

ok I overlooked that you have a hybrid search space, in that case random forest cant be used

I cant see anything obvious, although 2GB seems almost suspiciously low memory usage. Lets wait of the other devs have more ideas

In the meantime you could also try to model all parameters as discrete. For that the from_simplex instead of the from_product constructor can be efficient

Are you trying to model a mixture by any chance? We have a detailed example for how to do it with all discrete parameters. If you want to do that with molecular representations it can get complicated due to the constraints rather quickly. This is also one of the reasons why I suggest all-discrete parameters, some of the needed constraints are not yet supported between sets of mixed (ie both conti and discrete) parameters

AdrianSosic commented 3 months ago

Hi @brandon-holt, thanks for the report šŸ„‡ I (am 99% confident that I) know exactly what causes the problem. Will compile a detailed explanation and share few suggestions in the next hour but wanted to briefly speak up so that you and @Scienfitz can stop searching for the cause. It has nothing to do with your memories of the fact you are using GPs but all with the fact that you have a hybrid search space with a discrete part of non-trivial size, which makes the used optimization routine explode šŸ’„ (details to come ...)

AVHopp commented 3 months ago

I agree with @AdrianSosic that this is probably the reason. Won't go into much detail here until he has posted the more detailed explanation, but just wanted to confirm that this is probably the issue :)

AdrianSosic commented 3 months ago

Now here finally the explanation:

Problem

The "problem" with your setting is, as mentioned above, that you operate in a hybrid space and the applied optimizer simply does not work well in situations where the discrete subspace is large. Our main workhorse, the SequentialGreedyRecommender is really just a wrapper around what botorch ships for optimization. In hybrid spaces, this calls botorch's optimize_acqf_mixed, which unfortunately does not scale well. What it really does is that it runs a separate optimization of the continuous parameters for all possible configurations of your discrete parameters, for each of the points in your requested batch. That is, it doesn't perform a single optimization run but rather (size of discrete space) x (batch size) optimizations, which in our case is approx. 40,000 x 3 ~= 120.000 optimizations. So this explains why it takes so utterly long.

Solution

Unfortunately, optimizing hybrid spaces is notoriously hard and there is no easy solution ā€“ it's an active field of research. We've already started investigating more scalable approach some time ago but back then our code base was not yet ready for such advanced techniques. Today, the situation is different and we are already planning to continue our work on that end.

In the meantime, I think there are only the following things you can do:

The last one can be done very easily and we have a special convenience constructor for that. You'll find an example below. In this approach, you are only limited by your computer's memory size.

Example

The following should roughly reproduce your setting. The discretized version takes about one minute on my machine. However, the discretization is rather crude and finer resolutions will quickly crash your memory. Also, the involved dataframe operations take quite some time but at least for this part there is already a fix on the horizon (we are planning to transition to polars soon ...)

Setup Code

import numpy as np

from baybe.campaign import Campaign
from baybe.constraints.continuous import ContinuousLinearInequalityConstraint
from baybe.objective import Objective
from baybe.parameters.numerical import (
    NumericalContinuousParameter,
    NumericalDiscreteParameter,
)
from baybe.parameters.substance import SubstanceParameter
from baybe.searchspace.core import SearchSpace
from baybe.searchspace.discrete import SubspaceDiscrete
from baybe.targets.numerical import NumericalTarget

substances = [
    "C", "CC", "CN", "CO", "CCC", "CCN", "CCO", "CNC", "COC", "CCNC", "CCOC", "CNCN",
    "COCN", "COCO", "CCCOC", "CCNCC", "CCOCO", "CNCCO", "CNCNC", "CNCOC", "COCCN",
    "CCCCOC", "CCNCNC", "CCNCOC", "CCOCOC", "CNCCOC", "CNCNCN", "CNCNCO", "CNCOCO",
    "COCCOC", "CCCNCOC", "CCNCCNC", "CCOCCOC", "CCOCNCO", "CNCNCNC", "CNCNCOC",
    "CNCOCCN", "CNCOCOC", "COCNCCN", "CCCCCCCO", "CCCOCOCO", "CCNCCNCO", "CCNCCOCN",
    "CCNCOCCO", "CCNCOCOC", "CCOCNCOC", "CNCCNCOC", "CNCNCCOC", "COCCCNCN",
    "COCCNCCO", "CCCCNCNCN", "CCCCOCOCN", "CCNCCNCOC", "CCOCCOCNC", "CNCNCNCOC",
    "CNCNCOCNC", "CNCNCOCOC", "CNCOCNCOC", "COCCNCNCN", "COCCOCNCO", "COCOCOCOC",
    "CCNCNCOCNC", "CCOCCCOCOC", "CNCCOCOCOC", "CNCNCNCCOC", "CNCNCOCCCN", "COCCOCNCOC",
    "CCCOCCNCCOC", "CCNCOCCCCNC", "CNCCCCNCOCN", "CNCNCOCOCCO", "CNCOCOCOCOC",
    "COCCCNCNCCN", "COCCNCCNCOC", "COCNCOCOCOC", "CCCCOCNCOCOC", "CCCOCNCOCOCC",
    "CCNCCCNCNCOC", "CCNCNCNCNCOC", "CNCCOCNCNCNC", "CNCCOCOCOCNC", "CNCNCNCNCOCO",
    "CNCOCOCNCNCN", "COCOCNCOCOCO", "CCNCNCCOCCCCO", "CNCCNCCOCNCNC", "CNCCNCNCNCOCO",
    "CNCCOCCOCOCOC", "COCOCCCOCOCCO", "COCOCCNCNCNCN", "CCNCOCNCOCOCNC",
    "CCOCCNCCNCNCOC", "CCOCCOCNCNCNCO", "CCOCNCCOCOCOCN", "CNCNCNCCCCOCOC",
    "COCNCCCNCNCOCN", "COCOCCCCNCCOCO", "CCCCCCNCCCCCNCC", "CCNCOCCOCCNCCNC",
    "CCOCCOCNCCOCCOC", "COCCNCNCNCOCCOC", "COCCNCOCCOCOCOC", "CCNCCCCNCCOCNCNC",
    "CCNCCNCNCCNCOCNC", "CCNCOCCNCOCOCCNC", "CNCCCCCCCNCNCCOC", "CNCCCCOCCOCCNCNC",
    "CNCNCNCOCCOCNCNC", "COCCCNCNCOCNCCOC", "COCNCCCOCNCNCCCN", "COCNCOCNCNCCCNCO",
    "CCCOCCCNCOCOCCCNC", "CCCOCNCNCNCOCOCOC", "CCNCCCNCNCNCCNCNC", "CCNCOCCNCOCCNCCNC",
    "CNCNCCNCOCNCCNCOC", "CNCNCCOCCCNCNCOCO", "CNCOCCNCCNCNCOCNC", "COCCCNCNCNCCOCNCN",
    "COCOCCCNCCOCCOCOC", "CCCCCNCOCOCNCCOCCC", "CCNCOCNCNCCNCCNCOC",
    "CCOCNCCNCNCNCCOCNC", "CNCCNCCCNCNCCCNCNC", "CNCCNCOCNCOCOCCNCN",
    "CNCOCNCNCNCOCOCCOC", "CNCOCOCOCCOCOCNCCO", "COCNCCCCOCNCNCOCOC",
    "COCOCOCNCNCOCNCCCN", "CCNCCNCNCCCCNCOCCCO", "CCNCNCCOCOCCOCCOCNC",
    "CNCNCNCOCOCOCOCCCOC", "CNCOCNCCCOCNCOCNCCN", "CNCOCOCNCCNCOCCCCOC",
    "COCNCCCOCOCOCCCNCCO", "CCNCCNCCCCOCOCNCCNCC", "CCOCNCOCCOCCCOCOCOCC",
    "CNCNCNCOCNCNCNCCNCOC", "CNCOCCCCCOCCOCCCOCOC", "COCCNCCCOCNCCOCNCCOC",
]  # fmt: skip

chunks = [substances[:10], substances[10:20], substances[20:24], substances[24:]]
substance_parameters = [
    SubstanceParameter(
        name=f"s_{i}",
        data={f"substance_{j}": substance for j, substance in enumerate(chunk)},
    )
    for i, chunk in enumerate(chunks)
]
targets = [NumericalTarget(name="target", mode="MAX")]
objective = Objective(mode="SINGLE", targets=targets)

Search Space: Hybrid Version (your approach)

continuous_parameters = [
    NumericalContinuousParameter(name=f"c_{i}", bounds=(0, 1)) for i in range(4)
]
parameters = substance_parameters + continuous_parameters
constraints = [
    ContinuousLinearInequalityConstraint(
        parameters=[p.name for p in continuous_parameters],
        coefficients=[1.0 for _ in continuous_parameters],
        rhs=1.0,
    )
]
searchspace = SearchSpace.from_product(parameters, constraints)

Search Space: Discretized Version

discrete_parameters = [
    NumericalDiscreteParameter(name=f"d_{i}", values=np.linspace(0, 1, 5))
    for i in range(4)
]
searchspace = SearchSpace(
    discrete=SubspaceDiscrete.from_simplex(
        max_sum=1.0,
        simplex_parameters=discrete_parameters,
        product_parameters=substance_parameters,
        boundary_only=True,
    )
)

Getting Recommendations

campaign = Campaign(searchspace, objective)
recommendations = campaign.recommend(10)
recommendations["target"] = np.random.random(len(recommendations))
campaign.add_measurements(recommendations)
campaign.recommend(3)
AdrianSosic commented 3 months ago

Ah, forgot one more thing. Of course, you can also try to fix the problem from other angles. In fact, we have two potential other solutions in our current code base, but be aware that they are not yet properly tested against realistic examples and they will probably give you very crude approximations:

But as I said, both approaches are rather experimental and I wouldn't consider them actual solutions to your problem ...

AVHopp commented 3 months ago

Thanks for this detailed explanation @AdrianSosic :) Just some additional note regarding the NaiveHybridSpaceRecommender: This recommender optimizes the two subspaces, i.e., the discrete and the continuous part, independently. It allows you to use different recommenders for the different subspaces, and works surprisingly well sometimes. Still, it is not really able to make use of potential dependencies between the two different subspaces (by design), and also has the problem mentioned previously.

AVHopp commented 3 months ago

Also, regarding the use of SequentialGreedyRecommender, here is a quick example of how to set the sampling_percentage:

seq_greedy_recommender = TwoPhaseMetaRecommender(
    recommender=SequentialGreedyRecommender(
        hybrid_sampler="Random", sampling_percentage=0.05
    ),
)

You can choose between two different hybrid_sampler, namely Random and Farthest. Note however that you must not make the percentage too small as this is internally used to calculate an integer number of points to be used which can become 0.

brandon-holt commented 2 months ago

@Scienfitz @AdrianSosic @AVHopp Thank you all so much for this amazing responsiveness!! ā¤ļø These comments are all super helpful, and did help with the long run times. A couple of follow up questions:

1) I'm curious if adding constraints will also improve the runtime (decreases dimensionality of the searchspace)?

2) Also, I'd like to move towards a fully continuous searchspace, or at least fully numerical. In this regime, what are the trade-offs of going fully discrete vs. continuous? Is one faster? In other words, if I have a mix of discrete and continuous numerical variables, should I discretize my continuous variables, or relax my discrete ones?

Scienfitz commented 2 months ago

@brandon-holt

  1. Adding constraints should not be a question of performance usually. If your scientific problem requires constraints you should add them. If thats too slow, reduce your searchspace or upgrade the machine you run on. Adding a constraint is not technically something I would consider to this avail, but I guess it could be a path. Due to the brute force approach your current setup performs 40k x 3 separate gradient optimziations, any amount you can reduce the 40k by will imediately reduce the runtime by that same factor
  2. Its hard to give a concrete recommendation without knowing your problem and parameters that are needed. In general both approaches should work well and the decision between them is about questions like eg do you want to model categorical parameters, does the discrete variant of you searchspace explode in combinations or can my experiment even set values precisely to he 4th after-comma digit. Relaxation if conti variables is not supported by baybe itself, so Id be a bit careful with that. If the resulting searchspace doesnt explode, full discretization will likely work best and is also the variant we tested most internally (wih up to 15M searchspace rows)
brandon-holt commented 2 months ago

@Scienfitz Thank you, this is very helpful! Much appreciated

AdrianSosic commented 2 months ago

@brandon-holt, a few additional comments to question 2:

Discretization

Pros:

Cons:

Relaxation

Pros:

Cons:

brandon-holt commented 2 weeks ago

Now here finally the explanation:

Problem

The "problem" with your setting is, as mentioned above, that you operate in a hybrid space and the applied optimizer simply does not work well in situations where the discrete subspace is large. Our main workhorse, the SequentialGreedyRecommender is really just a wrapper around what botorch ships for optimization. In hybrid spaces, this calls botorch's optimize_acqf_mixed, which unfortunately does not scale well. What it really does is that it runs a separate optimization of the continuous parameters for all possible configurations of your discrete parameters, for each of the points in your requested batch. That is, it doesn't perform a single optimization run but rather (size of discrete space) x (batch size) optimizations, which in our case is approx. 40,000 x 3 ~= 120.000 optimizations. So this explains why it takes so utterly long.

Solution

Unfortunately, optimizing hybrid spaces is notoriously hard and there is no easy solution ā€“ it's an active field of research. We've already started investigating more scalable approach some time ago but back then our code base was not yet ready for such advanced techniques. Today, the situation is different and we are already planning to continue our work on that end.

In the meantime, I think there are only the following things you can do:

  • Reduce your problem size (not great, I know)
  • Use a continuous relaxation for your discrete variables (unfortunately not possible for substances and not built into baybe)
  • Implement an awesome alternative and push it via a PR (highly recommended :D)
  • Discretize your continuous variables.

The last one can be done very easily and we have a special convenience constructor for that. You'll find an example below. In this approach, you are only limited by your computer's memory size.

Example

The following should roughly reproduce your setting. The discretized version takes about one minute on my machine. However, the discretization is rather crude and finer resolutions will quickly crash your memory. Also, the involved dataframe operations take quite some time but at least for this part there is already a fix on the horizon (we are planning to transition to polars soon ...)

Setup Code

import numpy as np

from baybe.campaign import Campaign
from baybe.constraints.continuous import ContinuousLinearInequalityConstraint
from baybe.objective import Objective
from baybe.parameters.numerical import (
    NumericalContinuousParameter,
    NumericalDiscreteParameter,
)
from baybe.parameters.substance import SubstanceParameter
from baybe.searchspace.core import SearchSpace
from baybe.searchspace.discrete import SubspaceDiscrete
from baybe.targets.numerical import NumericalTarget

substances = [
    "C", "CC", "CN", "CO", "CCC", "CCN", "CCO", "CNC", "COC", "CCNC", "CCOC", "CNCN",
    "COCN", "COCO", "CCCOC", "CCNCC", "CCOCO", "CNCCO", "CNCNC", "CNCOC", "COCCN",
    "CCCCOC", "CCNCNC", "CCNCOC", "CCOCOC", "CNCCOC", "CNCNCN", "CNCNCO", "CNCOCO",
    "COCCOC", "CCCNCOC", "CCNCCNC", "CCOCCOC", "CCOCNCO", "CNCNCNC", "CNCNCOC",
    "CNCOCCN", "CNCOCOC", "COCNCCN", "CCCCCCCO", "CCCOCOCO", "CCNCCNCO", "CCNCCOCN",
    "CCNCOCCO", "CCNCOCOC", "CCOCNCOC", "CNCCNCOC", "CNCNCCOC", "COCCCNCN",
    "COCCNCCO", "CCCCNCNCN", "CCCCOCOCN", "CCNCCNCOC", "CCOCCOCNC", "CNCNCNCOC",
    "CNCNCOCNC", "CNCNCOCOC", "CNCOCNCOC", "COCCNCNCN", "COCCOCNCO", "COCOCOCOC",
    "CCNCNCOCNC", "CCOCCCOCOC", "CNCCOCOCOC", "CNCNCNCCOC", "CNCNCOCCCN", "COCCOCNCOC",
    "CCCOCCNCCOC", "CCNCOCCCCNC", "CNCCCCNCOCN", "CNCNCOCOCCO", "CNCOCOCOCOC",
    "COCCCNCNCCN", "COCCNCCNCOC", "COCNCOCOCOC", "CCCCOCNCOCOC", "CCCOCNCOCOCC",
    "CCNCCCNCNCOC", "CCNCNCNCNCOC", "CNCCOCNCNCNC", "CNCCOCOCOCNC", "CNCNCNCNCOCO",
    "CNCOCOCNCNCN", "COCOCNCOCOCO", "CCNCNCCOCCCCO", "CNCCNCCOCNCNC", "CNCCNCNCNCOCO",
    "CNCCOCCOCOCOC", "COCOCCCOCOCCO", "COCOCCNCNCNCN", "CCNCOCNCOCOCNC",
    "CCOCCNCCNCNCOC", "CCOCCOCNCNCNCO", "CCOCNCCOCOCOCN", "CNCNCNCCCCOCOC",
    "COCNCCCNCNCOCN", "COCOCCCCNCCOCO", "CCCCCCNCCCCCNCC", "CCNCOCCOCCNCCNC",
    "CCOCCOCNCCOCCOC", "COCCNCNCNCOCCOC", "COCCNCOCCOCOCOC", "CCNCCCCNCCOCNCNC",
    "CCNCCNCNCCNCOCNC", "CCNCOCCNCOCOCCNC", "CNCCCCCCCNCNCCOC", "CNCCCCOCCOCCNCNC",
    "CNCNCNCOCCOCNCNC", "COCCCNCNCOCNCCOC", "COCNCCCOCNCNCCCN", "COCNCOCNCNCCCNCO",
    "CCCOCCCNCOCOCCCNC", "CCCOCNCNCNCOCOCOC", "CCNCCCNCNCNCCNCNC", "CCNCOCCNCOCCNCCNC",
    "CNCNCCNCOCNCCNCOC", "CNCNCCOCCCNCNCOCO", "CNCOCCNCCNCNCOCNC", "COCCCNCNCNCCOCNCN",
    "COCOCCCNCCOCCOCOC", "CCCCCNCOCOCNCCOCCC", "CCNCOCNCNCCNCCNCOC",
    "CCOCNCCNCNCNCCOCNC", "CNCCNCCCNCNCCCNCNC", "CNCCNCOCNCOCOCCNCN",
    "CNCOCNCNCNCOCOCCOC", "CNCOCOCOCCOCOCNCCO", "COCNCCCCOCNCNCOCOC",
    "COCOCOCNCNCOCNCCCN", "CCNCCNCNCCCCNCOCCCO", "CCNCNCCOCOCCOCCOCNC",
    "CNCNCNCOCOCOCOCCCOC", "CNCOCNCCCOCNCOCNCCN", "CNCOCOCNCCNCOCCCCOC",
    "COCNCCCOCOCOCCCNCCO", "CCNCCNCCCCOCOCNCCNCC", "CCOCNCOCCOCCCOCOCOCC",
    "CNCNCNCOCNCNCNCCNCOC", "CNCOCCCCCOCCOCCCOCOC", "COCCNCCCOCNCCOCNCCOC",
]  # fmt: skip

chunks = [substances[:10], substances[10:20], substances[20:24], substances[24:]]
substance_parameters = [
    SubstanceParameter(
        name=f"s_{i}",
        data={f"substance_{j}": substance for j, substance in enumerate(chunk)},
    )
    for i, chunk in enumerate(chunks)
]
targets = [NumericalTarget(name="target", mode="MAX")]
objective = Objective(mode="SINGLE", targets=targets)

Search Space: Hybrid Version (your approach)

continuous_parameters = [
    NumericalContinuousParameter(name=f"c_{i}", bounds=(0, 1)) for i in range(4)
]
parameters = substance_parameters + continuous_parameters
constraints = [
    ContinuousLinearInequalityConstraint(
        parameters=[p.name for p in continuous_parameters],
        coefficients=[1.0 for _ in continuous_parameters],
        rhs=1.0,
    )
]
searchspace = SearchSpace.from_product(parameters, constraints)

Search Space: Discretized Version

discrete_parameters = [
    NumericalDiscreteParameter(name=f"d_{i}", values=np.linspace(0, 1, 5))
    for i in range(4)
]
searchspace = SearchSpace(
    discrete=SubspaceDiscrete.from_simplex(
        max_sum=1.0,
        simplex_parameters=discrete_parameters,
        product_parameters=substance_parameters,
        boundary_only=True,
    )
)

Getting Recommendations

campaign = Campaign(searchspace, objective)
recommendations = campaign.recommend(10)
recommendations["target"] = np.random.random(len(recommendations))
campaign.add_measurements(recommendations)
campaign.recommend(3)

Now here finally the explanation:

Problem

The "problem" with your setting is, as mentioned above, that you operate in a hybrid space and the applied optimizer simply does not work well in situations where the discrete subspace is large. Our main workhorse, the SequentialGreedyRecommender is really just a wrapper around what botorch ships for optimization. In hybrid spaces, this calls botorch's optimize_acqf_mixed, which unfortunately does not scale well. What it really does is that it runs a separate optimization of the continuous parameters for all possible configurations of your discrete parameters, for each of the points in your requested batch. That is, it doesn't perform a single optimization run but rather (size of discrete space) x (batch size) optimizations, which in our case is approx. 40,000 x 3 ~= 120.000 optimizations. So this explains why it takes so utterly long.

Solution

Unfortunately, optimizing hybrid spaces is notoriously hard and there is no easy solution ā€“ it's an active field of research. We've already started investigating more scalable approach some time ago but back then our code base was not yet ready for such advanced techniques. Today, the situation is different and we are already planning to continue our work on that end.

In the meantime, I think there are only the following things you can do:

  • Reduce your problem size (not great, I know)
  • Use a continuous relaxation for your discrete variables (unfortunately not possible for substances and not built into baybe)
  • Implement an awesome alternative and push it via a PR (highly recommended :D)
  • Discretize your continuous variables.

The last one can be done very easily and we have a special convenience constructor for that. You'll find an example below. In this approach, you are only limited by your computer's memory size.

Example

The following should roughly reproduce your setting. The discretized version takes about one minute on my machine. However, the discretization is rather crude and finer resolutions will quickly crash your memory. Also, the involved dataframe operations take quite some time but at least for this part there is already a fix on the horizon (we are planning to transition to polars soon ...)

Setup Code

import numpy as np

from baybe.campaign import Campaign
from baybe.constraints.continuous import ContinuousLinearInequalityConstraint
from baybe.objective import Objective
from baybe.parameters.numerical import (
    NumericalContinuousParameter,
    NumericalDiscreteParameter,
)
from baybe.parameters.substance import SubstanceParameter
from baybe.searchspace.core import SearchSpace
from baybe.searchspace.discrete import SubspaceDiscrete
from baybe.targets.numerical import NumericalTarget

substances = [
    "C", "CC", "CN", "CO", "CCC", "CCN", "CCO", "CNC", "COC", "CCNC", "CCOC", "CNCN",
    "COCN", "COCO", "CCCOC", "CCNCC", "CCOCO", "CNCCO", "CNCNC", "CNCOC", "COCCN",
    "CCCCOC", "CCNCNC", "CCNCOC", "CCOCOC", "CNCCOC", "CNCNCN", "CNCNCO", "CNCOCO",
    "COCCOC", "CCCNCOC", "CCNCCNC", "CCOCCOC", "CCOCNCO", "CNCNCNC", "CNCNCOC",
    "CNCOCCN", "CNCOCOC", "COCNCCN", "CCCCCCCO", "CCCOCOCO", "CCNCCNCO", "CCNCCOCN",
    "CCNCOCCO", "CCNCOCOC", "CCOCNCOC", "CNCCNCOC", "CNCNCCOC", "COCCCNCN",
    "COCCNCCO", "CCCCNCNCN", "CCCCOCOCN", "CCNCCNCOC", "CCOCCOCNC", "CNCNCNCOC",
    "CNCNCOCNC", "CNCNCOCOC", "CNCOCNCOC", "COCCNCNCN", "COCCOCNCO", "COCOCOCOC",
    "CCNCNCOCNC", "CCOCCCOCOC", "CNCCOCOCOC", "CNCNCNCCOC", "CNCNCOCCCN", "COCCOCNCOC",
    "CCCOCCNCCOC", "CCNCOCCCCNC", "CNCCCCNCOCN", "CNCNCOCOCCO", "CNCOCOCOCOC",
    "COCCCNCNCCN", "COCCNCCNCOC", "COCNCOCOCOC", "CCCCOCNCOCOC", "CCCOCNCOCOCC",
    "CCNCCCNCNCOC", "CCNCNCNCNCOC", "CNCCOCNCNCNC", "CNCCOCOCOCNC", "CNCNCNCNCOCO",
    "CNCOCOCNCNCN", "COCOCNCOCOCO", "CCNCNCCOCCCCO", "CNCCNCCOCNCNC", "CNCCNCNCNCOCO",
    "CNCCOCCOCOCOC", "COCOCCCOCOCCO", "COCOCCNCNCNCN", "CCNCOCNCOCOCNC",
    "CCOCCNCCNCNCOC", "CCOCCOCNCNCNCO", "CCOCNCCOCOCOCN", "CNCNCNCCCCOCOC",
    "COCNCCCNCNCOCN", "COCOCCCCNCCOCO", "CCCCCCNCCCCCNCC", "CCNCOCCOCCNCCNC",
    "CCOCCOCNCCOCCOC", "COCCNCNCNCOCCOC", "COCCNCOCCOCOCOC", "CCNCCCCNCCOCNCNC",
    "CCNCCNCNCCNCOCNC", "CCNCOCCNCOCOCCNC", "CNCCCCCCCNCNCCOC", "CNCCCCOCCOCCNCNC",
    "CNCNCNCOCCOCNCNC", "COCCCNCNCOCNCCOC", "COCNCCCOCNCNCCCN", "COCNCOCNCNCCCNCO",
    "CCCOCCCNCOCOCCCNC", "CCCOCNCNCNCOCOCOC", "CCNCCCNCNCNCCNCNC", "CCNCOCCNCOCCNCCNC",
    "CNCNCCNCOCNCCNCOC", "CNCNCCOCCCNCNCOCO", "CNCOCCNCCNCNCOCNC", "COCCCNCNCNCCOCNCN",
    "COCOCCCNCCOCCOCOC", "CCCCCNCOCOCNCCOCCC", "CCNCOCNCNCCNCCNCOC",
    "CCOCNCCNCNCNCCOCNC", "CNCCNCCCNCNCCCNCNC", "CNCCNCOCNCOCOCCNCN",
    "CNCOCNCNCNCOCOCCOC", "CNCOCOCOCCOCOCNCCO", "COCNCCCCOCNCNCOCOC",
    "COCOCOCNCNCOCNCCCN", "CCNCCNCNCCCCNCOCCCO", "CCNCNCCOCOCCOCCOCNC",
    "CNCNCNCOCOCOCOCCCOC", "CNCOCNCCCOCNCOCNCCN", "CNCOCOCNCCNCOCCCCOC",
    "COCNCCCOCOCOCCCNCCO", "CCNCCNCCCCOCOCNCCNCC", "CCOCNCOCCOCCCOCOCOCC",
    "CNCNCNCOCNCNCNCCNCOC", "CNCOCCCCCOCCOCCCOCOC", "COCCNCCCOCNCCOCNCCOC",
]  # fmt: skip

chunks = [substances[:10], substances[10:20], substances[20:24], substances[24:]]
substance_parameters = [
    SubstanceParameter(
        name=f"s_{i}",
        data={f"substance_{j}": substance for j, substance in enumerate(chunk)},
    )
    for i, chunk in enumerate(chunks)
]
targets = [NumericalTarget(name="target", mode="MAX")]
objective = Objective(mode="SINGLE", targets=targets)

Search Space: Hybrid Version (your approach)

continuous_parameters = [
    NumericalContinuousParameter(name=f"c_{i}", bounds=(0, 1)) for i in range(4)
]
parameters = substance_parameters + continuous_parameters
constraints = [
    ContinuousLinearInequalityConstraint(
        parameters=[p.name for p in continuous_parameters],
        coefficients=[1.0 for _ in continuous_parameters],
        rhs=1.0,
    )
]
searchspace = SearchSpace.from_product(parameters, constraints)

Search Space: Discretized Version

discrete_parameters = [
    NumericalDiscreteParameter(name=f"d_{i}", values=np.linspace(0, 1, 5))
    for i in range(4)
]
searchspace = SearchSpace(
    discrete=SubspaceDiscrete.from_simplex(
        max_sum=1.0,
        simplex_parameters=discrete_parameters,
        product_parameters=substance_parameters,
        boundary_only=True,
    )
)

Getting Recommendations

campaign = Campaign(searchspace, objective)
recommendations = campaign.recommend(10)
recommendations["target"] = np.random.random(len(recommendations))
campaign.add_measurements(recommendations)
campaign.recommend(3)

@Scienfitz @AdrianSosic Hi following up on this, this worked for me very well with my original dataset. However, I am trying to include an expanded set of 11 features that is pushing the memory over the brink of what I have available again, and I'm wondering if theres a better way to include these features. I'm attaching a spreadsheet of the dataset of the 11 additional features to give you an idea of the complexity of the dataset. Is there something you see in here that would lend itself to a different way of constructing the searchspace? Is 11 new features really that much given the context?

new_features.csv

AdrianSosic commented 1 week ago

Hi @brandon-holt, please apologize, I saw your message on the weekend and then forgot on Monday, so it got lost šŸ™ˆ

Can you quickly bring me up to speed again how exactly you currently try to create the search space based on this table? That is, if you have loaded the csv into a dataframe df, how do you attempt to create a search space object from df? Could you quickly share those few lines of code with me?

brandon-holt commented 6 days ago

@AdrianSosic Hey no worries, it happens! So this would be revising based on the "Search Space: Discretized Version" in the tagged comment:

new_features = pd.read_csv('new_features.csv')
updated_substance_parameters = deepcopy(substance_parameters)
for nf in new_features.columns:
    values = sorted(new_features[nf].unique())
    if len(values) == 1:
        continue
    param = NumericalDiscreteParameter(name=nf, values=values)
    updated_substance_parameters.append(param)

searchspace = SearchSpace(
    discrete=SubspaceDiscrete.from_simplex(
        max_sum=200,
        simplex_parameters=discrete_parameters,
        product_parameters=updated_substance_parameters,
        boundary_only=False,
    )
)

Basically, just adding in these new features as numerical discrete parameters to the substance parameters, that get slotted in as the product parameters when constructing the discrete searchspace from simplex (excuse the inaccurate name I'm aware once we add these numerical discrete features the updated_substance_parameters list, it now includes a mix of substance parameters and numerical discrete parameters).

AdrianSosic commented 1 day ago

Hi @brandon-holt, finally had some time to look into (these days the workload is a bit heavy šŸ˜¬).

To answer your question if "addding the features is really that much": Have a look at our new helper method SubspaceDiscrete.estimate_product_space_size. It gives me:

Here you can witness what the term "exponential explosion" really means šŸ™ƒ. So building that product space is just not possible, we need to find a different approach. Is a discrete product search space really what you want/need?