emdgroup / baybe

Bayesian Optimization and Design of Experiments
https://emdgroup.github.io/baybe/
Apache License 2.0
249 stars 39 forks source link

Estimate shape of search space? #223

Closed monken closed 4 months ago

monken commented 4 months ago

Is there a way to calculate an upper bound for the shape of the search space? I'm trying to find a way to give the user feedback on the complexity of the search space and also how much memory needs to be allocated to proceed with the campaign and generate experiments.

Scienfitz commented 4 months ago

@monken unfortunately this is not very straightforward in a genreal sense, but for simple combinations you can estimate something

We generally distinguish two big memory bottlenecks 1) Creation and storage of the searchspace 2) During the optimization which is triggered by recommend

We can help to estimate for 1) but for 2) its trickier and I wont tackle that here.

Discrete searchspace is basically just a cartesian product of all parameters and their values, but encodings and constraints can complicate things. In what we call the computational representation we have the searchspace in a matrix format, where the rows are all combinations of parameter values and the columns are all columns coming from each parameters computational representation. In general, this means you need A * B floating point numbers, where A is the number of parameter-value combinations and B is the number of columns to achieve a computational representation of all parameters.

If You have only simple parameters Simple parameters are NumericalDiscreteParameter and CategoricalParameter. If a parameter1 has M_1 values and parameter2 has M_2 values to total number of combinations is M_1 * M_2. Since each parameter has 1 columns in what we call the computational representation you will have 2 columns representing the numbers that go into the ML algo. So the total number of floats this matrix holds is M_1 * M_2 * 2 or in general M_1 *M_2 * ... * M_N * N float numbers where N is the number of parameters.

Unfortunately this only holds for some settings, eg the NumericalDiscreteParameter and the CategoircalParameter in INT encoding. But the CategoricalParameter can also have OHE encoding, where the parameter is not represented by 1 column but by M columns where M is again the number of possible values the parameter has. So instead of counting 1 for the number of cols you have to add M columns. In the above example if our first parameter was a numerical discrete one and the second one was a categorical one with OHE encoding, wed get M_1 * M_2 * (1 + M_2) float numbers.

If you have specially encoded parameters This is where it gets tricky, for CustomParameter and SubstanceParameter the number of columns in the computational representation cannot be inferred a priori. It depends on the values and eg the molecules that are encoded. If they are all similar less columns are needed, if they differ a lot more columns are be needed.

This complicates the approach where the total number of floats in the matrix are estimated. One way to find out how many columns are needed for these parameters is: If you instantiate all parameter objects (without creating a searchspace from it), you have access to the my_parameter.comp_df dataframe. .shape includes now the exact amount of values this parameter can take (dimension1) and the number of columns these are represented with (dimension2). This leads me to

Discrete Space Estimate Formula Following the logic above, if we have n parameters with each M_i possible values and C_i columns for its computational representation, we get the upper bound for the number of floating point numbers needed to represent the searchspace:

A = M_1 * M_2 * ... * M_n
B = C_1 + C_2 + ... + C_n
float_numbers_needed = A * B

@AdrianSosic @AVHopp perhaps we could offer a utility accepting a list of parameters and outputting the required upper memory bound in human readable unit according to this approach?

If you have constraints In that case the exact amount of discrete combinations is hard to estimate, as constraints apply a computation logic to delete invalid combos, so one can't tell in advance. However, a constraint always removes search space entries, so if you estimate the search space size without constraints you have an upper bound

Currently we are greedily building the entire searchspace in memory before filtering, so even if you have constraints the unconstrained version needs to fit in memory. This is rather bad, but we are already say 50% of the way of an implementation for this where we use polars to build and filter the dataframe lazily before it is collected entirely into memory.

NumericalContinuousParameters do not contribute to the search space size Having said that that, they can cause a lot of memory requirements during optimization. This is because if you have a mixed discrete-conti searchspace, it will perform one gradient optimization per disceret combination. This can easily lead to 10s of thousands of gradient optimziations.

monken commented 4 months ago

Thank you, that's very helpful. This should be enough for me to calculate an estimate. Ultimately, I will use this number to launch the right-sized EC2 instance to do the optimization.

It looks like you are using double precision floats by default (https://github.com/emdgroup/baybe/blob/46137953f3377e944737a63a0e2ab10cc015cac1/baybe/utils/numerical.py#L8). Does double precision have a material impact on the quality of the optimization? It basically doubles the memory required.

AdrianSosic commented 4 months ago

Hi @monken, my take: everything what Martin said + keep in mind that this only applies to "product search spaces", i.e. those generated via SearchSpace.from_product, which kind of is like an upper limit. If you use different constructors, i.e. SubspaceDiscrete.from_simplex etc, slightly different rules apply but the overall memory usage will be smaller.

Regarding double precision: it's less about the quality of the optimization but more due to numerical stability during the optimization. This is a general recommendation from botorch side to avoid certain errors e.g. related matrix operations. However, as long as the optimization succeeds, I think nothing speaks against using float32 instead. I can image to make this controllable via an environment variable, for example. Would require a bit of testing though. Any particular thoughts / wishes?

Scienfitz commented 4 months ago

@monken fyi we added a utility to estimate required memory called estimate_product_space_size:

from baybe.parameters import (
    CategoricalParameter,
    NumericalContinuousParameter,
    NumericalDiscreteParameter,
    SubstanceParameter,
)

parameters = [
    NumericalDiscreteParameter(name="p1", values=[1, 2, 3]),
    CategoricalParameter(name="p2", values=["A", "B", "C"]),
    SubstanceParameter(name="p3", data={"s1": "C", "s2": "CC"}),
    NumericalContinuousParameter(name="p4", bounds=(0, 10)),
]

from baybe.searchspace import SearchSpace, SubspaceDiscrete

res1 = SearchSpace.estimate_product_space_size(parameters)
res2 = SubspaceDiscrete.estimate_product_space_size(parameters[:-1]) # must drop continuous parameters

print(res1)
>>> MemorySize(exp_rep_memory=2.2, exp_rep_unit='KB', exp_rep_shape=(18, 3), comp_rep_memory=720, comp_rep_unit='bytes', comp_rep_shape=(18, 5))

exp_rep_memory and comp_rep_memory should be added to get an estimate of how much memory total is required just for creating the campaign (not the optimization).

Constraints are not included for now, so these are truly upper bound estimates.

monken commented 4 months ago

Tasty, thank you! Any chance I could force the output to be in bytes? estimate_product_space_size(parameters, 'b')? Otherwise I'd have to parse the unit string and undo the unit to be able to add the numbers and get the total (I also need it to compare to the OS free memory so just a byte number is easier to deal with).

Scienfitz commented 3 months ago

@monken sure, included with #249 and will be published with the next version 0.9.1 The output will now look like

res = SearchSpace.estimate_product_space_size(parameters)
res
>>> MemorySize(exp_rep_bytes=2250.0, exp_rep_shape=(18, 3), comp_rep_bytes=720, comp_rep_shape=(18, 5))

and human readable versions can still be obtained via property

res.exp_rep_human_readable
>>> (2.20, 'KB')