emdgroup / baybe

Bayesian Optimization and Design of Experiments
https://emdgroup.github.io/baybe/
Apache License 2.0
265 stars 42 forks source link

Cannot assign the following values containing duplicates to parameter X #58

Closed mmortazavi closed 9 months ago

mmortazavi commented 10 months ago

Thanks for being such a cool abstraction for deisng experiments using the powerful Bayesian methods.

I have a set of experimental data that each contains various parameters. I am not sure it is by design, but in experiments it is natural that for multiple experiments that a parameter to hold similar values (duplicates), however, when defining the parameters either CategoricalParameter or NumericalDiscreteParameter, or NumericalContinuousParameter...I get the traceback error as an example for one of the parameters: ValueError: Cannot assign the following values containing duplicates to parameter FeatureName: (1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6).

AdrianSosic commented 10 months ago

Hi @mmortazavi, thank you very much! ❤️‍🔥 Great to see that people are already interacting with the framework 🏅Getting user input at this stage will be really helpful for us to shape our APIs and also to adapt to the various use cases that are out there.

To your question: The fact that a parameter may not contain duplicate values is by design. Probably, it's up to us to better explain why this is the case (we are currently working on the docs!). But to give you the idea: The purpose of the parameter objects is to define what (physical) configurations are possible in the first place. While you can repeat an experiment for a given value, say 1, this would still refer to the same underlying setting in the parameter space. Of course, having duplicate experiments is perfectly valid, on the other hand. These would then enter by having different entries in your measurements dataframe that refer to that same underlying parameter setting, though.

For example, this is a perfectly valid setting:

import pandas as pd

from baybe import Campaign
from baybe.objective import Objective
from baybe.parameters import NumericalDiscreteParameter
from baybe.searchspace import SearchSpace
from baybe.targets import NumericalTarget

parameters = NumericalDiscreteParameter(name="param", values=[1, 2, 3])
targets = NumericalTarget(name="target", mode="MAX")
objective = Objective(targets=[targets], mode="SINGLE")
searchspace = SearchSpace.from_product([parameters])
campaign = Campaign(searchspace=searchspace, objective=objective)

# You can provide as many duplicates as you wish. Each record will be treated as a
# single data point that enters the model. Duplicate parameter settings simply provide
# more evidence about the target values for those configurations, helping to reduce
# model uncertainty.
data_containing_duplicates = pd.DataFrame.from_records(
    [
        {"param": 1, "target": 0},
        {"param": 1, "target": 0},  # exact duplicate
        {"param": 1, "target": 1},  # duplicate parameter setting, different measurement
        {"param": 2, "target": 5},
    ]
)
campaign.add_measurements(data_containing_duplicates)

But perhaps I misunderstood the question?

mmortazavi commented 10 months ago

@AdrianSosic Pleasure. I will do my best to continue the tool and provide feedback. I can see that there are rooms for improvements, documentations, real-data examples from various industry domains, features etc.

Many thanks for the detailed answer. I now understand the parameter objects, possible (physical) configurations are totally reasonable, meaning unique set of configurations needs to be set and defined in the parameter space. I have dealt with that in my real physical data. However, now I am facing another challenge. Not sure this thread is the right place though to discuss about the follow up problem!

I have 25 parameters, and in total 27 experiments (i.e. rows). When I want to create the searchspace (Cartesian product of all possibilities):

searchspace = SearchSpace.from_product(parameters)
campaign = Campaign(searchspace=searchspace, objective=objective)

I get the following traceback:

     54     return _wrapit(obj, method, *args, **kwds)
     56 try:
---> 57     return bound(*args, **kwds)
     58 except TypeError:
     59     # A TypeError occurs if the object does have such a method in its
     60     # class, but its signature is not identical to that of NumPy's. This
   (...)
     64     # Call _wrapit from within the except clause to ensure a potential
     65     # exception has a traceback chain.
     66     return _wrapit(obj, method, *args, **kwds)

MemoryError: Unable to allocate 10.8 PiB for an array with shape (12150000000000000,) and data type int8

Somehow I feel the search space has become large, but in fact it is not! I have tried other ways to create the Searchspace, or even another methods in the doc to create the search space, so far my attemps failed!

If you suggest I post this in separate issue, let me know, I can do that, to elaborate a bit more even.

full code:

import pandas as pd
from baybe import Campaign
from baybe.objective import Objective
from baybe.parameters import NumericalDiscreteParameter, CategoricalParameter
from baybe.searchspace import SearchSpace, SubspaceDiscrete, SubspaceContinuous
from baybe.targets import NumericalTarget

df.head(2)

DROP        FF  FWHM 1  FWHM 2  FWHM 3  FWHM 4  FWHM 5   FWHM 6  2-theta 1  2-theta 2  2-theta 3  2-theta 4  2-theta 5  2-theta 6  Intensity 1  Intensity 2  Intensity 3  Intensity 4  Intensity 5  Intensity 6  relative_intensity 1  relative_intensity 2  relative_intensity 3  relative_intensity 4  relative_intensity 5  relative_intensity 6
    1 38.487449    0.45    0.35    0.51    0.57    0.54 0.925902      32.25       34.9      36.86      48.29      57.39      63.89          0.5          0.4         0.78         0.15          0.3         0.18                  64.1                 51.28                 100.0                 19.23                 38.46                 23.07
    2 41.509692    0.45    0.35    0.51    0.57    0.54 0.925902      32.25       34.9      36.86      48.29      57.39      63.89          0.5          0.4         0.78         0.15          0.3         0.18                  64.1                 51.28                 100.0                 19.23                 38.46                 23.07

numerical_df = df.drop(['FF', "DROP"], axis=1)
categorical_df = df[["DROP"]]
target_df = df[["FF"]]

target = NumericalTarget(
    name="FF",
    mode="MAX",
)
objective = Objective(mode="SINGLE", targets=[target])

parameters = []

for numerical_col in numerical_df.columns:
    parameters.append(NumericalDiscreteParameter(
                                                name=numerical_col,
                                                values=tuple(set(df[numerical_col])),
                                                ))
for categorical_col in categorical_df.columns:
    parameters.append(CategoricalParameter(
                                        name=categorical_col,
                                        values=tuple(set(df[categorical_col])),
                                        encoding="INT",
                                        ))

searchspace = SearchSpace.from_product(parameters)
campaign = Campaign(searchspace=searchspace, objective=objective)
AdrianSosic commented 10 months ago

Hi @mmortazavi, glad to hear that this makes sense to you!

Regarding your new error: strictly speaking, this is not directly related to BayBE but to the way how you attempt to represent and create your search space. You see, if you try to build the full Cartesian product of 25 parameters, the corresponding array/Dataframe to hold the resulting set of configurations becomes exponentially large and you will run out of memory very quickly. Unfortunately, I can't see the exact sizes of each parameter from the code you shared (i.e., what are the numbers of possible values each parameter can take?) but even for the smallest numbers you get astronomically large sizes quickly. For instance, if each parameter can take 2 possible values only, the resulting Cartesian product already consists of 2^25=33.554.432 elements (still manageable), for three it's already 3^25=847.288.609.443 elements, ... you see where this goes.

So how to solve this? The answer depends a bit on what you try to achieve. Is it really the full product space that you want to optimize over?

Let me know if this helps 👍🏼

Scienfitz commented 10 months ago

Hi @mmortazavi , summarizing in a slightly more concise way:

mmortazavi commented 10 months ago

Thanks @Scienfitz and @AdrianSosic for your hints and recommendations. As explained above, the problem of create parameters form set on non-unique value is understood and dealt with. The memory problem as a result of SearchSpace.from_product(parameters) is also understandable. I believe @AdrianSosic your line of thinking makes more sense to my problem.

Essentially "Is it really the full product space that you want to optimize over?" is "no".

Since we are discussing back and force about the suitable method available at baybe, let me describe the problem at hand better. As @Scienfitz guessed correctly, I have a limited number of experimental measurements. The sample data I posted is about features generated from XRD of experimental samples, plus a substance (DROP) ,and the target is basically is Fill Factor (FF) (i.e. is a measure of the "squareness" of the solar cell and is also the area of the largest rectangle which will fit in the IV curve).

The ultimate goal is, given the set of conducted experimental measurements, how (based on what next experiment) FF can be maximized? I am trying to use Bayesian methods to explore the search space, and for a physics-driven exploration-exploitation trade-off, get the next (or a set) of recommended experiment. In this case, if XRD generated-features are used, would provide guidelines what next growth mechanism can be used in lab for the next batch!

I am wondering myself now, if XRD driven features, shall be considered a discrete search space (I am leaning towards this since we have more control over the next measurement), or rather continues space!

Currently, based on earlier suggestions, I am using the below code:

from baybe import Campaign
from baybe.objective import Objective
from baybe.parameters import NumericalDiscreteParameter, CategoricalParameter, NumericalContinuousParameter
from baybe.searchspace import SearchSpace, SubspaceDiscrete, SubspaceContinuous
from baybe.targets import NumericalTarget

numerical_df = df.drop(['FF', "DROP"], axis=1)
categorical_df = df[["DROP"]]
target_df = df[["FF"]]

target = NumericalTarget(
    name="FF",
    mode="MAX",
)
objective = Objective(mode="SINGLE", targets=[target])

parameters = []

for numerical_col in numerical_df.columns:
    parameters.append(NumericalDiscreteParameter(
                                                name=numerical_col,
                                                values=tuple(set(df[numerical_col])),
                                                ))
for categorical_col in categorical_df.columns:
    parameters.append(CategoricalParameter(
                                        name=categorical_col,
                                        values=tuple(set(df[categorical_col])),
                                        encoding="INT",
                                        ))

searchspace = SearchSpace(discrete=SubspaceDiscrete.from_dataframe(df, parameters))
campaign = Campaign(searchspace=searchspace, objective=objective)

Whether, given the last explanation of the problem at hand, the apporach is valid or not, I am still puzzled how from the created campaign object, we optimize and search for the next points. Simply using campaign.recommend(batch_quantity=3) doesn't seem correct, is it? At least from here, I get 3 rows of data, exactly the same as I have in my data.

And here SearchSpace(discrete=SubspaceDiscrete.from_dataframe(df, parameters)), the df I pass is the whole dataframe containig my target. I saw in one of the doc, and target can or shall be added later like in one of the demos:

df["Target"] = [values...]
campaign.add_measurements(df)

Perhaps you can guide me here for a full example, if already in documentation. I have already used Python bayes_opt package to achieve the same thing. There I have used ML-model as a surrogate model to estimate the unknown objective function for the underlying data and use Bayesian to identify the next point of interest. However, there I have less flexibility on defining my search space (discrete for instance is not straightforward). I was thinking to go with BoTorch, then I stumped upon BayBe!

I wonder also what happens in the background, as of my current tests, if I do not provide a surrogate model! Is by default e.g. one of the available standard ones like GaussianProcessSurrogate is used?

As you can see, I am not an expert in Bayesian methods, and trying to familiarize myself with its limitation and formulate the problem at hand as correctly as possible to analyse my data.

Scienfitz commented 10 months ago

@mmortazavi

1) The utility SubspaceDiscrete.from_dataframe(configurations) will create a searchspace that only searches combinations that are present in configurations. If your configurations equals your existing measurements you will only ever be recommended the same configurations you already measured. Also I see you did not remove the target and hence it is added as parameter there too.

2) It seems you never inform the campaign about your measurements, hence you cannot expect any smart recommendation. Creating the SearchSpace just means informing the campaign about what parameters there are and what are their possible values, but not about actually performed measurements that you might already have.

This is done with campaign.add_measurements. Did you follow this basic example closely? https://emdgroup.github.io/baybe/examples/Basics/campaign.html It contains all steps, even if you create the search space in a different manner, the subsequent workflow is the same.

Your question about the strategy: The strategy is optional and a default one is selected if not provided. The default one includes a Gaussian Process as surrogate with a SequentialGreedy Optimizer. You can create your own strategy for instance like this:

strategy = TwoPhaseStrategy(
    initial_recommender=INITIAL_RECOMMENDER,
    recommender=SequentialGreedyRecommender(
        surrogate_model=SURROGATE_MODEL, acquisition_function_cls=ACQ_FUNCTION
    ),
    allow_repeated_recommendations=ALLOW_REPEATED_RECOMMENDATIONS,
    allow_recommending_already_measured=ALLOW_RECOMMENDING_ALREADY_MEASURED,
)

for more info consult https://emdgroup.github.io/baybe/examples/Basics/strategies.html#creating-the-strategy-object.

The TwoPhaseStrategy by default will use a Bayesian algorithm to give you your recommendation. But ONLY if data is available. If there was no data added via add_measurements yet and you haven't selected any other algorithm, you will get a random recommendation which might just be what you observed.

AdrianSosic commented 9 months ago

Hi @mmortazavi. Since there was no more response from your end, I assume that all questions could be clarified? Therefore, I will close this issue now. However, please feel free to reopen any time in case you like to continue the discussion. Best, Adrian 🤙🏼