Closed brandon-holt closed 2 months ago
Hi @brandon-holt
the long time with 1000 data points could be caused by running into memory limit, I imagine that possible on a 16GB machine as it might write stuff temporarily to disk (super slow). but its weird if it also happens for only 10 added datapoints. Did you monitor memory after you start requesting recommendations?
Theres also the possibility that you've constructed a gigantic searchspace (not in feature dimension but in number of combinations). Please provide the way you construct it. I'd also be interested in the dimensionality of your searchspace obtained via campaign_object.searchspace.discrete.comp_rep
and campaign_obejct.searchspace.discrete.exp_rep
and
To investigate whether the surrogate model choice is causing this you could use a more scaling friendly random forest model (here done for a ngboost model https://emdgroup.github.io/baybe/examples/Custom_Surrogates/surrogate_params.html but you can easily change it to RandomForestSurrogate
)
Hi @Scienfitz thanks for the quick reply!
I suspect memory isn't the issue because memory stays far below my capacity (usually ~2 GB) throughout the duration of the recommendation.
I construct the searchspace like this:
SearchSpace.from_product(parameters, constraints)
comp_rep produces a table with 43,740 rows x 194 columns exp_rep produces a table with 43,740 rows x 4 columns
Also, when I try to use a RandomForestSurrogate, I get the following error: NotImplementedError: Continuous search spaces are currently only supported by GPs.
ok I overlooked that you have a hybrid search space, in that case random forest cant be used
I cant see anything obvious, although 2GB seems almost suspiciously low memory usage. Lets wait of the other devs have more ideas
In the meantime you could also try to model all parameters as discrete. For that the from_simplex
instead of the from_product
constructor can be efficient
Are you trying to model a mixture by any chance? We have a detailed example for how to do it with all discrete parameters. If you want to do that with molecular representations it can get complicated due to the constraints rather quickly. This is also one of the reasons why I suggest all-discrete parameters, some of the needed constraints are not yet supported between sets of mixed (ie both conti and discrete) parameters
Hi @brandon-holt, thanks for the report š„ I (am 99% confident that I) know exactly what causes the problem. Will compile a detailed explanation and share few suggestions in the next hour but wanted to briefly speak up so that you and @Scienfitz can stop searching for the cause. It has nothing to do with your memories of the fact you are using GPs but all with the fact that you have a hybrid search space with a discrete part of non-trivial size, which makes the used optimization routine explode š„ (details to come ...)
I agree with @AdrianSosic that this is probably the reason. Won't go into much detail here until he has posted the more detailed explanation, but just wanted to confirm that this is probably the issue :)
Now here finally the explanation:
The "problem" with your setting is, as mentioned above, that you operate in a hybrid space and the applied optimizer simply does not work well in situations where the discrete subspace is large. Our main workhorse, the SequentialGreedyRecommender
is really just a wrapper around what botorch ships for optimization. In hybrid spaces, this calls botorch's optimize_acqf_mixed
, which unfortunately does not scale well. What it really does is that it runs a separate optimization of the continuous parameters for all possible configurations of your discrete parameters, for each of the points in your requested batch. That is, it doesn't perform a single optimization run but rather (size of discrete space) x (batch size)
optimizations, which in our case is approx. 40,000 x 3 ~= 120.000 optimizations. So this explains why it takes so utterly long.
Unfortunately, optimizing hybrid spaces is notoriously hard and there is no easy solution ā it's an active field of research. We've already started investigating more scalable approach some time ago but back then our code base was not yet ready for such advanced techniques. Today, the situation is different and we are already planning to continue our work on that end.
In the meantime, I think there are only the following things you can do:
The last one can be done very easily and we have a special convenience constructor for that. You'll find an example below. In this approach, you are only limited by your computer's memory size.
The following should roughly reproduce your setting. The discretized version takes about one minute on my machine. However, the discretization is rather crude and finer resolutions will quickly crash your memory. Also, the involved dataframe operations take quite some time but at least for this part there is already a fix on the horizon (we are planning to transition to polars soon ...)
import numpy as np
from baybe.campaign import Campaign
from baybe.constraints.continuous import ContinuousLinearInequalityConstraint
from baybe.objective import Objective
from baybe.parameters.numerical import (
NumericalContinuousParameter,
NumericalDiscreteParameter,
)
from baybe.parameters.substance import SubstanceParameter
from baybe.searchspace.core import SearchSpace
from baybe.searchspace.discrete import SubspaceDiscrete
from baybe.targets.numerical import NumericalTarget
substances = [
"C", "CC", "CN", "CO", "CCC", "CCN", "CCO", "CNC", "COC", "CCNC", "CCOC", "CNCN",
"COCN", "COCO", "CCCOC", "CCNCC", "CCOCO", "CNCCO", "CNCNC", "CNCOC", "COCCN",
"CCCCOC", "CCNCNC", "CCNCOC", "CCOCOC", "CNCCOC", "CNCNCN", "CNCNCO", "CNCOCO",
"COCCOC", "CCCNCOC", "CCNCCNC", "CCOCCOC", "CCOCNCO", "CNCNCNC", "CNCNCOC",
"CNCOCCN", "CNCOCOC", "COCNCCN", "CCCCCCCO", "CCCOCOCO", "CCNCCNCO", "CCNCCOCN",
"CCNCOCCO", "CCNCOCOC", "CCOCNCOC", "CNCCNCOC", "CNCNCCOC", "COCCCNCN",
"COCCNCCO", "CCCCNCNCN", "CCCCOCOCN", "CCNCCNCOC", "CCOCCOCNC", "CNCNCNCOC",
"CNCNCOCNC", "CNCNCOCOC", "CNCOCNCOC", "COCCNCNCN", "COCCOCNCO", "COCOCOCOC",
"CCNCNCOCNC", "CCOCCCOCOC", "CNCCOCOCOC", "CNCNCNCCOC", "CNCNCOCCCN", "COCCOCNCOC",
"CCCOCCNCCOC", "CCNCOCCCCNC", "CNCCCCNCOCN", "CNCNCOCOCCO", "CNCOCOCOCOC",
"COCCCNCNCCN", "COCCNCCNCOC", "COCNCOCOCOC", "CCCCOCNCOCOC", "CCCOCNCOCOCC",
"CCNCCCNCNCOC", "CCNCNCNCNCOC", "CNCCOCNCNCNC", "CNCCOCOCOCNC", "CNCNCNCNCOCO",
"CNCOCOCNCNCN", "COCOCNCOCOCO", "CCNCNCCOCCCCO", "CNCCNCCOCNCNC", "CNCCNCNCNCOCO",
"CNCCOCCOCOCOC", "COCOCCCOCOCCO", "COCOCCNCNCNCN", "CCNCOCNCOCOCNC",
"CCOCCNCCNCNCOC", "CCOCCOCNCNCNCO", "CCOCNCCOCOCOCN", "CNCNCNCCCCOCOC",
"COCNCCCNCNCOCN", "COCOCCCCNCCOCO", "CCCCCCNCCCCCNCC", "CCNCOCCOCCNCCNC",
"CCOCCOCNCCOCCOC", "COCCNCNCNCOCCOC", "COCCNCOCCOCOCOC", "CCNCCCCNCCOCNCNC",
"CCNCCNCNCCNCOCNC", "CCNCOCCNCOCOCCNC", "CNCCCCCCCNCNCCOC", "CNCCCCOCCOCCNCNC",
"CNCNCNCOCCOCNCNC", "COCCCNCNCOCNCCOC", "COCNCCCOCNCNCCCN", "COCNCOCNCNCCCNCO",
"CCCOCCCNCOCOCCCNC", "CCCOCNCNCNCOCOCOC", "CCNCCCNCNCNCCNCNC", "CCNCOCCNCOCCNCCNC",
"CNCNCCNCOCNCCNCOC", "CNCNCCOCCCNCNCOCO", "CNCOCCNCCNCNCOCNC", "COCCCNCNCNCCOCNCN",
"COCOCCCNCCOCCOCOC", "CCCCCNCOCOCNCCOCCC", "CCNCOCNCNCCNCCNCOC",
"CCOCNCCNCNCNCCOCNC", "CNCCNCCCNCNCCCNCNC", "CNCCNCOCNCOCOCCNCN",
"CNCOCNCNCNCOCOCCOC", "CNCOCOCOCCOCOCNCCO", "COCNCCCCOCNCNCOCOC",
"COCOCOCNCNCOCNCCCN", "CCNCCNCNCCCCNCOCCCO", "CCNCNCCOCOCCOCCOCNC",
"CNCNCNCOCOCOCOCCCOC", "CNCOCNCCCOCNCOCNCCN", "CNCOCOCNCCNCOCCCCOC",
"COCNCCCOCOCOCCCNCCO", "CCNCCNCCCCOCOCNCCNCC", "CCOCNCOCCOCCCOCOCOCC",
"CNCNCNCOCNCNCNCCNCOC", "CNCOCCCCCOCCOCCCOCOC", "COCCNCCCOCNCCOCNCCOC",
] # fmt: skip
chunks = [substances[:10], substances[10:20], substances[20:24], substances[24:]]
substance_parameters = [
SubstanceParameter(
name=f"s_{i}",
data={f"substance_{j}": substance for j, substance in enumerate(chunk)},
)
for i, chunk in enumerate(chunks)
]
targets = [NumericalTarget(name="target", mode="MAX")]
objective = Objective(mode="SINGLE", targets=targets)
continuous_parameters = [
NumericalContinuousParameter(name=f"c_{i}", bounds=(0, 1)) for i in range(4)
]
parameters = substance_parameters + continuous_parameters
constraints = [
ContinuousLinearInequalityConstraint(
parameters=[p.name for p in continuous_parameters],
coefficients=[1.0 for _ in continuous_parameters],
rhs=1.0,
)
]
searchspace = SearchSpace.from_product(parameters, constraints)
discrete_parameters = [
NumericalDiscreteParameter(name=f"d_{i}", values=np.linspace(0, 1, 5))
for i in range(4)
]
searchspace = SearchSpace(
discrete=SubspaceDiscrete.from_simplex(
max_sum=1.0,
simplex_parameters=discrete_parameters,
product_parameters=substance_parameters,
boundary_only=True,
)
)
campaign = Campaign(searchspace, objective)
recommendations = campaign.recommend(10)
recommendations["target"] = np.random.random(len(recommendations))
campaign.add_measurements(recommendations)
campaign.recommend(3)
Ah, forgot one more thing. Of course, you can also try to fix the problem from other angles. In fact, we have two potential other solutions in our current code base, but be aware that they are not yet properly tested against realistic examples and they will probably give you very crude approximations:
optimize_mixed_acqf
considers for the optimization. This can be done by passing the optional sampling_percentage
argument to the SequentialGreedyRecommender
.NaiveHybridSpaceRecommender
instead, which doesn't have scaling problems but will return recommendations that are not batch-optimized (i.e. each point from the batch is optimized separately from the rest). But as I said, both approaches are rather experimental and I wouldn't consider them actual solutions to your problem ...
Thanks for this detailed explanation @AdrianSosic :) Just some additional note regarding the NaiveHybridSpaceRecommender
: This recommender optimizes the two subspaces, i.e., the discrete and the continuous part, independently. It allows you to use different recommenders for the different subspaces, and works surprisingly well sometimes. Still, it is not really able to make use of potential dependencies between the two different subspaces (by design), and also has the problem mentioned previously.
Also, regarding the use of SequentialGreedyRecommender
, here is a quick example of how to set the sampling_percentage
:
seq_greedy_recommender = TwoPhaseMetaRecommender(
recommender=SequentialGreedyRecommender(
hybrid_sampler="Random", sampling_percentage=0.05
),
)
You can choose between two different hybrid_sampler
, namely Random
and Farthest
.
Note however that you must not make the percentage too small as this is internally used to calculate an integer number of points to be used which can become 0.
@Scienfitz @AdrianSosic @AVHopp Thank you all so much for this amazing responsiveness!! ā¤ļø These comments are all super helpful, and did help with the long run times. A couple of follow up questions:
1) I'm curious if adding constraints will also improve the runtime (decreases dimensionality of the searchspace)?
2) Also, I'd like to move towards a fully continuous searchspace, or at least fully numerical. In this regime, what are the trade-offs of going fully discrete vs. continuous? Is one faster? In other words, if I have a mix of discrete and continuous numerical variables, should I discretize my continuous variables, or relax my discrete ones?
@brandon-holt
do you want to model categorical parameters
, does the discrete variant of you searchspace explode in combinations
or can my experiment even set values precisely to he 4th after-comma digit
. Relaxation if conti variables is not supported by baybe itself, so Id be a bit careful with that. If the resulting searchspace doesnt explode, full discretization will likely work best and is also the variant we tested most internally (wih up to 15M searchspace rows)@Scienfitz Thank you, this is very helpful! Much appreciated
@brandon-holt, a few additional comments to question 2:
Pros:
Cons:
Pros:
Cons:
Now here finally the explanation:
Problem
The "problem" with your setting is, as mentioned above, that you operate in a hybrid space and the applied optimizer simply does not work well in situations where the discrete subspace is large. Our main workhorse, the
SequentialGreedyRecommender
is really just a wrapper around what botorch ships for optimization. In hybrid spaces, this calls botorch'soptimize_acqf_mixed
, which unfortunately does not scale well. What it really does is that it runs a separate optimization of the continuous parameters for all possible configurations of your discrete parameters, for each of the points in your requested batch. That is, it doesn't perform a single optimization run but rather(size of discrete space) x (batch size)
optimizations, which in our case is approx. 40,000 x 3 ~= 120.000 optimizations. So this explains why it takes so utterly long.Solution
Unfortunately, optimizing hybrid spaces is notoriously hard and there is no easy solution ā it's an active field of research. We've already started investigating more scalable approach some time ago but back then our code base was not yet ready for such advanced techniques. Today, the situation is different and we are already planning to continue our work on that end.
In the meantime, I think there are only the following things you can do:
- Reduce your problem size (not great, I know)
- Use a continuous relaxation for your discrete variables (unfortunately not possible for substances and not built into baybe)
- Implement an awesome alternative and push it via a PR (highly recommended :D)
- Discretize your continuous variables.
The last one can be done very easily and we have a special convenience constructor for that. You'll find an example below. In this approach, you are only limited by your computer's memory size.
Example
The following should roughly reproduce your setting. The discretized version takes about one minute on my machine. However, the discretization is rather crude and finer resolutions will quickly crash your memory. Also, the involved dataframe operations take quite some time but at least for this part there is already a fix on the horizon (we are planning to transition to polars soon ...)
Setup Code
import numpy as np from baybe.campaign import Campaign from baybe.constraints.continuous import ContinuousLinearInequalityConstraint from baybe.objective import Objective from baybe.parameters.numerical import ( NumericalContinuousParameter, NumericalDiscreteParameter, ) from baybe.parameters.substance import SubstanceParameter from baybe.searchspace.core import SearchSpace from baybe.searchspace.discrete import SubspaceDiscrete from baybe.targets.numerical import NumericalTarget substances = [ "C", "CC", "CN", "CO", "CCC", "CCN", "CCO", "CNC", "COC", "CCNC", "CCOC", "CNCN", "COCN", "COCO", "CCCOC", "CCNCC", "CCOCO", "CNCCO", "CNCNC", "CNCOC", "COCCN", "CCCCOC", "CCNCNC", "CCNCOC", "CCOCOC", "CNCCOC", "CNCNCN", "CNCNCO", "CNCOCO", "COCCOC", "CCCNCOC", "CCNCCNC", "CCOCCOC", "CCOCNCO", "CNCNCNC", "CNCNCOC", "CNCOCCN", "CNCOCOC", "COCNCCN", "CCCCCCCO", "CCCOCOCO", "CCNCCNCO", "CCNCCOCN", "CCNCOCCO", "CCNCOCOC", "CCOCNCOC", "CNCCNCOC", "CNCNCCOC", "COCCCNCN", "COCCNCCO", "CCCCNCNCN", "CCCCOCOCN", "CCNCCNCOC", "CCOCCOCNC", "CNCNCNCOC", "CNCNCOCNC", "CNCNCOCOC", "CNCOCNCOC", "COCCNCNCN", "COCCOCNCO", "COCOCOCOC", "CCNCNCOCNC", "CCOCCCOCOC", "CNCCOCOCOC", "CNCNCNCCOC", "CNCNCOCCCN", "COCCOCNCOC", "CCCOCCNCCOC", "CCNCOCCCCNC", "CNCCCCNCOCN", "CNCNCOCOCCO", "CNCOCOCOCOC", "COCCCNCNCCN", "COCCNCCNCOC", "COCNCOCOCOC", "CCCCOCNCOCOC", "CCCOCNCOCOCC", "CCNCCCNCNCOC", "CCNCNCNCNCOC", "CNCCOCNCNCNC", "CNCCOCOCOCNC", "CNCNCNCNCOCO", "CNCOCOCNCNCN", "COCOCNCOCOCO", "CCNCNCCOCCCCO", "CNCCNCCOCNCNC", "CNCCNCNCNCOCO", "CNCCOCCOCOCOC", "COCOCCCOCOCCO", "COCOCCNCNCNCN", "CCNCOCNCOCOCNC", "CCOCCNCCNCNCOC", "CCOCCOCNCNCNCO", "CCOCNCCOCOCOCN", "CNCNCNCCCCOCOC", "COCNCCCNCNCOCN", "COCOCCCCNCCOCO", "CCCCCCNCCCCCNCC", "CCNCOCCOCCNCCNC", "CCOCCOCNCCOCCOC", "COCCNCNCNCOCCOC", "COCCNCOCCOCOCOC", "CCNCCCCNCCOCNCNC", "CCNCCNCNCCNCOCNC", "CCNCOCCNCOCOCCNC", "CNCCCCCCCNCNCCOC", "CNCCCCOCCOCCNCNC", "CNCNCNCOCCOCNCNC", "COCCCNCNCOCNCCOC", "COCNCCCOCNCNCCCN", "COCNCOCNCNCCCNCO", "CCCOCCCNCOCOCCCNC", "CCCOCNCNCNCOCOCOC", "CCNCCCNCNCNCCNCNC", "CCNCOCCNCOCCNCCNC", "CNCNCCNCOCNCCNCOC", "CNCNCCOCCCNCNCOCO", "CNCOCCNCCNCNCOCNC", "COCCCNCNCNCCOCNCN", "COCOCCCNCCOCCOCOC", "CCCCCNCOCOCNCCOCCC", "CCNCOCNCNCCNCCNCOC", "CCOCNCCNCNCNCCOCNC", "CNCCNCCCNCNCCCNCNC", "CNCCNCOCNCOCOCCNCN", "CNCOCNCNCNCOCOCCOC", "CNCOCOCOCCOCOCNCCO", "COCNCCCCOCNCNCOCOC", "COCOCOCNCNCOCNCCCN", "CCNCCNCNCCCCNCOCCCO", "CCNCNCCOCOCCOCCOCNC", "CNCNCNCOCOCOCOCCCOC", "CNCOCNCCCOCNCOCNCCN", "CNCOCOCNCCNCOCCCCOC", "COCNCCCOCOCOCCCNCCO", "CCNCCNCCCCOCOCNCCNCC", "CCOCNCOCCOCCCOCOCOCC", "CNCNCNCOCNCNCNCCNCOC", "CNCOCCCCCOCCOCCCOCOC", "COCCNCCCOCNCCOCNCCOC", ] # fmt: skip chunks = [substances[:10], substances[10:20], substances[20:24], substances[24:]] substance_parameters = [ SubstanceParameter( name=f"s_{i}", data={f"substance_{j}": substance for j, substance in enumerate(chunk)}, ) for i, chunk in enumerate(chunks) ] targets = [NumericalTarget(name="target", mode="MAX")] objective = Objective(mode="SINGLE", targets=targets)
Search Space: Hybrid Version (your approach)
continuous_parameters = [ NumericalContinuousParameter(name=f"c_{i}", bounds=(0, 1)) for i in range(4) ] parameters = substance_parameters + continuous_parameters constraints = [ ContinuousLinearInequalityConstraint( parameters=[p.name for p in continuous_parameters], coefficients=[1.0 for _ in continuous_parameters], rhs=1.0, ) ] searchspace = SearchSpace.from_product(parameters, constraints)
Search Space: Discretized Version
discrete_parameters = [ NumericalDiscreteParameter(name=f"d_{i}", values=np.linspace(0, 1, 5)) for i in range(4) ] searchspace = SearchSpace( discrete=SubspaceDiscrete.from_simplex( max_sum=1.0, simplex_parameters=discrete_parameters, product_parameters=substance_parameters, boundary_only=True, ) )
Getting Recommendations
campaign = Campaign(searchspace, objective) recommendations = campaign.recommend(10) recommendations["target"] = np.random.random(len(recommendations)) campaign.add_measurements(recommendations) campaign.recommend(3)
Now here finally the explanation:
Problem
The "problem" with your setting is, as mentioned above, that you operate in a hybrid space and the applied optimizer simply does not work well in situations where the discrete subspace is large. Our main workhorse, the
SequentialGreedyRecommender
is really just a wrapper around what botorch ships for optimization. In hybrid spaces, this calls botorch'soptimize_acqf_mixed
, which unfortunately does not scale well. What it really does is that it runs a separate optimization of the continuous parameters for all possible configurations of your discrete parameters, for each of the points in your requested batch. That is, it doesn't perform a single optimization run but rather(size of discrete space) x (batch size)
optimizations, which in our case is approx. 40,000 x 3 ~= 120.000 optimizations. So this explains why it takes so utterly long.Solution
Unfortunately, optimizing hybrid spaces is notoriously hard and there is no easy solution ā it's an active field of research. We've already started investigating more scalable approach some time ago but back then our code base was not yet ready for such advanced techniques. Today, the situation is different and we are already planning to continue our work on that end.
In the meantime, I think there are only the following things you can do:
- Reduce your problem size (not great, I know)
- Use a continuous relaxation for your discrete variables (unfortunately not possible for substances and not built into baybe)
- Implement an awesome alternative and push it via a PR (highly recommended :D)
- Discretize your continuous variables.
The last one can be done very easily and we have a special convenience constructor for that. You'll find an example below. In this approach, you are only limited by your computer's memory size.
Example
The following should roughly reproduce your setting. The discretized version takes about one minute on my machine. However, the discretization is rather crude and finer resolutions will quickly crash your memory. Also, the involved dataframe operations take quite some time but at least for this part there is already a fix on the horizon (we are planning to transition to polars soon ...)
Setup Code
import numpy as np from baybe.campaign import Campaign from baybe.constraints.continuous import ContinuousLinearInequalityConstraint from baybe.objective import Objective from baybe.parameters.numerical import ( NumericalContinuousParameter, NumericalDiscreteParameter, ) from baybe.parameters.substance import SubstanceParameter from baybe.searchspace.core import SearchSpace from baybe.searchspace.discrete import SubspaceDiscrete from baybe.targets.numerical import NumericalTarget substances = [ "C", "CC", "CN", "CO", "CCC", "CCN", "CCO", "CNC", "COC", "CCNC", "CCOC", "CNCN", "COCN", "COCO", "CCCOC", "CCNCC", "CCOCO", "CNCCO", "CNCNC", "CNCOC", "COCCN", "CCCCOC", "CCNCNC", "CCNCOC", "CCOCOC", "CNCCOC", "CNCNCN", "CNCNCO", "CNCOCO", "COCCOC", "CCCNCOC", "CCNCCNC", "CCOCCOC", "CCOCNCO", "CNCNCNC", "CNCNCOC", "CNCOCCN", "CNCOCOC", "COCNCCN", "CCCCCCCO", "CCCOCOCO", "CCNCCNCO", "CCNCCOCN", "CCNCOCCO", "CCNCOCOC", "CCOCNCOC", "CNCCNCOC", "CNCNCCOC", "COCCCNCN", "COCCNCCO", "CCCCNCNCN", "CCCCOCOCN", "CCNCCNCOC", "CCOCCOCNC", "CNCNCNCOC", "CNCNCOCNC", "CNCNCOCOC", "CNCOCNCOC", "COCCNCNCN", "COCCOCNCO", "COCOCOCOC", "CCNCNCOCNC", "CCOCCCOCOC", "CNCCOCOCOC", "CNCNCNCCOC", "CNCNCOCCCN", "COCCOCNCOC", "CCCOCCNCCOC", "CCNCOCCCCNC", "CNCCCCNCOCN", "CNCNCOCOCCO", "CNCOCOCOCOC", "COCCCNCNCCN", "COCCNCCNCOC", "COCNCOCOCOC", "CCCCOCNCOCOC", "CCCOCNCOCOCC", "CCNCCCNCNCOC", "CCNCNCNCNCOC", "CNCCOCNCNCNC", "CNCCOCOCOCNC", "CNCNCNCNCOCO", "CNCOCOCNCNCN", "COCOCNCOCOCO", "CCNCNCCOCCCCO", "CNCCNCCOCNCNC", "CNCCNCNCNCOCO", "CNCCOCCOCOCOC", "COCOCCCOCOCCO", "COCOCCNCNCNCN", "CCNCOCNCOCOCNC", "CCOCCNCCNCNCOC", "CCOCCOCNCNCNCO", "CCOCNCCOCOCOCN", "CNCNCNCCCCOCOC", "COCNCCCNCNCOCN", "COCOCCCCNCCOCO", "CCCCCCNCCCCCNCC", "CCNCOCCOCCNCCNC", "CCOCCOCNCCOCCOC", "COCCNCNCNCOCCOC", "COCCNCOCCOCOCOC", "CCNCCCCNCCOCNCNC", "CCNCCNCNCCNCOCNC", "CCNCOCCNCOCOCCNC", "CNCCCCCCCNCNCCOC", "CNCCCCOCCOCCNCNC", "CNCNCNCOCCOCNCNC", "COCCCNCNCOCNCCOC", "COCNCCCOCNCNCCCN", "COCNCOCNCNCCCNCO", "CCCOCCCNCOCOCCCNC", "CCCOCNCNCNCOCOCOC", "CCNCCCNCNCNCCNCNC", "CCNCOCCNCOCCNCCNC", "CNCNCCNCOCNCCNCOC", "CNCNCCOCCCNCNCOCO", "CNCOCCNCCNCNCOCNC", "COCCCNCNCNCCOCNCN", "COCOCCCNCCOCCOCOC", "CCCCCNCOCOCNCCOCCC", "CCNCOCNCNCCNCCNCOC", "CCOCNCCNCNCNCCOCNC", "CNCCNCCCNCNCCCNCNC", "CNCCNCOCNCOCOCCNCN", "CNCOCNCNCNCOCOCCOC", "CNCOCOCOCCOCOCNCCO", "COCNCCCCOCNCNCOCOC", "COCOCOCNCNCOCNCCCN", "CCNCCNCNCCCCNCOCCCO", "CCNCNCCOCOCCOCCOCNC", "CNCNCNCOCOCOCOCCCOC", "CNCOCNCCCOCNCOCNCCN", "CNCOCOCNCCNCOCCCCOC", "COCNCCCOCOCOCCCNCCO", "CCNCCNCCCCOCOCNCCNCC", "CCOCNCOCCOCCCOCOCOCC", "CNCNCNCOCNCNCNCCNCOC", "CNCOCCCCCOCCOCCCOCOC", "COCCNCCCOCNCCOCNCCOC", ] # fmt: skip chunks = [substances[:10], substances[10:20], substances[20:24], substances[24:]] substance_parameters = [ SubstanceParameter( name=f"s_{i}", data={f"substance_{j}": substance for j, substance in enumerate(chunk)}, ) for i, chunk in enumerate(chunks) ] targets = [NumericalTarget(name="target", mode="MAX")] objective = Objective(mode="SINGLE", targets=targets)
Search Space: Hybrid Version (your approach)
continuous_parameters = [ NumericalContinuousParameter(name=f"c_{i}", bounds=(0, 1)) for i in range(4) ] parameters = substance_parameters + continuous_parameters constraints = [ ContinuousLinearInequalityConstraint( parameters=[p.name for p in continuous_parameters], coefficients=[1.0 for _ in continuous_parameters], rhs=1.0, ) ] searchspace = SearchSpace.from_product(parameters, constraints)
Search Space: Discretized Version
discrete_parameters = [ NumericalDiscreteParameter(name=f"d_{i}", values=np.linspace(0, 1, 5)) for i in range(4) ] searchspace = SearchSpace( discrete=SubspaceDiscrete.from_simplex( max_sum=1.0, simplex_parameters=discrete_parameters, product_parameters=substance_parameters, boundary_only=True, ) )
Getting Recommendations
campaign = Campaign(searchspace, objective) recommendations = campaign.recommend(10) recommendations["target"] = np.random.random(len(recommendations)) campaign.add_measurements(recommendations) campaign.recommend(3)
@Scienfitz @AdrianSosic Hi following up on this, this worked for me very well with my original dataset. However, I am trying to include an expanded set of 11 features that is pushing the memory over the brink of what I have available again, and I'm wondering if theres a better way to include these features. I'm attaching a spreadsheet of the dataset of the 11 additional features to give you an idea of the complexity of the dataset. Is there something you see in here that would lend itself to a different way of constructing the searchspace? Is 11 new features really that much given the context?
Hi @brandon-holt, please apologize, I saw your message on the weekend and then forgot on Monday, so it got lost š
Can you quickly bring me up to speed again how exactly you currently try to create the search space based on this table? That is, if you have loaded the csv into a dataframe df
, how do you attempt to create a search space object from df
? Could you quickly share those few lines of code with me?
@AdrianSosic Hey no worries, it happens! So this would be revising based on the "Search Space: Discretized Version" in the tagged comment:
new_features = pd.read_csv('new_features.csv')
updated_substance_parameters = deepcopy(substance_parameters)
for nf in new_features.columns:
values = sorted(new_features[nf].unique())
if len(values) == 1:
continue
param = NumericalDiscreteParameter(name=nf, values=values)
updated_substance_parameters.append(param)
searchspace = SearchSpace(
discrete=SubspaceDiscrete.from_simplex(
max_sum=200,
simplex_parameters=discrete_parameters,
product_parameters=updated_substance_parameters,
boundary_only=False,
)
)
Basically, just adding in these new features as numerical discrete parameters to the substance parameters, that get slotted in as the product parameters when constructing the discrete searchspace from simplex (excuse the inaccurate name I'm aware once we add these numerical discrete features the updated_substance_parameters
list, it now includes a mix of substance parameters and numerical discrete parameters).
Hi @brandon-holt, finally had some time to look into (these days the workload is a bit heavy š¬).
To answer your question if "addding the features is really that much": Have a look at our new helper method SubspaceDiscrete.estimate_product_space_size
. It gives me:
SubspaceDiscrete.estimate_product_space_size(substance_parameters).exp_rep_human_readable
: ~12 MBSubspaceDiscrete.estimate_product_space_size(updated_substance_parameters).exp_rep_human_readable
: ~10 TerraByte š Here you can witness what the term "exponential explosion" really means š. So building that product space is just not possible, we need to find a different approach. Is a discrete product search space really what you want/need?
I'm wondering if the recommendation times I'm encountering are expected given my setup:
Machine: MacBook Air, 15 inch, M2, 2023 Memory: 16 GB OS: Sonoma 14.4.1 Python: 3.11.8
Model: Single NumericalTarget Parameters: 4 SubstanceParameters (~140 total SMILES molecules), 4 NumericalContinuousParameters Constraints: 4 numerical parameters must sum to 1.0 Recommender: TwoPhaseMetaRecommender( initial_recommender=RandomRecommender(),
recommender=SequentialGreedyRecommender())
So when I add 1000 datapoints via campaign.add_measurements() it takes ~4 days to make a recommendation with a batch size of 3. I started a test with only 10 datapoints and it is still running from overnight.
Does this sound expected given my machine, model, and data? If so, what would be the recommended ways to improve the speed? For the molecules I've tried with and without mordred & decorrelation, doesn't seem to make a big difference.
If this doesn't sound expected, how would you recommend I troubleshoot what could be causing the issue?
Thanks in advance!