experimental-design / bofire

Experimental design and (multi-objective) bayesian optimization.
https://experimental-design.github.io/bofire/
BSD 3-Clause "New" or "Revised" License
188 stars 22 forks source link

Questions on Usage (Output of Acquisition Function Values, Design Methods for Search Space) #333

Open tatsuya-takakuwa opened 7 months ago

tatsuya-takakuwa commented 7 months ago

Thank you always

Is there a way to obtain the evaluation value of the acquisition function, not just the mean and variance of the predicted results obtained with the .ask function? I want to use it as a clue to prioritize candidates.

Also, if the search space is, for instance, the types of molecules of reagents that can be purchased, and the descriptors generated from the molecular structure of the reagents are the features, the combination of features is fixed. In such a case, how should I design the search space using bofire?"

jduerholt commented 7 months ago

You could take the candidates after their generation and and fed them into the strategy.calc_acquisition method, then you will return the actual acqf values. https://github.com/experimental-design/bofire/blob/591401c0db8963b116b3c8193e11c2e49b88738a/bofire/strategies/predictives/botorch.py#L160

Regarding your second question: I am not sure that I completely understand your question, if you have hand designed descriptors, you could setup the search space using CategoricalDescriptorInput, if you want to use Mordred Descriptors, or Morgan Fingerprints, or Fragment Descriptors and want to generate them on the fly, you can use the CategoricalMolecularInput feature and define within the SurrogateSpecs which featurizer you actually want to use. Note that we currently can work only with fully combinatorical search spaces in case of the CategoricalMolecularInput, mixed search spaces using this alongside continuous inputs is currently still work in progess. If you can provide a bit more details about your problem, I can set up also a minimal working example for you.

Maybe also this tutorial notebook could be helpful for you: https://github.com/experimental-design/bofire/blob/main/tutorials/benchmarks/009-Bayesian_optimization_over_molecules.ipynb

cc: @simonsung06

tatsuya-takakuwa commented 7 months ago

@jduerholt Thank you for your reply. I was able to get the evaluation values of the acquisition function. Thank you very much.

Also, thank you for the information about CategoricalDescriptorInput and CategoricalMolecularInput.

I apologize for the additional questions, but I have two points of inquiry regarding the use of the above:

  1. Variable importance after learning in the surrogate model

    When checking variable importance while building a preliminary model, only the names of keys such as CategoricalDescriptorInput were listed, and the importance of each feature within them was not visible. Is there a way to check the importance of each feature?

  2. The relationship of interaction when two CategoricalDescriptorInputs exist

    When using properties of two molecules as explanatory variables, there are cases where features are created by combining them, such as ratios. (For example, I want to combine the properties of solvent and solute molecules with their concentrations.) In such cases, can I preprocess CategoricalDescriptorInput to create features that represent combined properties?

jduerholt commented 7 months ago

Nothing to apologize!

Regarding your questions:

  1. Did you use the Permutation Feature Importance feature? Currently it runs only over orginal features and not the transformed ones. In principal, it could be extended in this direction, but this will take a while at least when we do it, as it has currently not the highest priority. But feel free to give it a try! If you are using a SingleTaskGPSurrogate, you could have a look at the lenghtscales, here it is shown how to extract them: https://github.com/experimental-design/bofire/blob/591401c0db8963b116b3c8193e11c2e49b88738a/bofire/surrogates/feature_importance.py#L12 from the kernel. The current implementation of the method will crash when using it with CategoricalDescriptors, but it should be very easy for you to either extend the method, or just apply the extraction on the fitted GP and assign the lengthscales to the individual featues. In case of questions, I am happy to assist or provide you with an MWE.

  2. This is currently not yet implemented, and only prepared with the ContinuousDescriptorInput, but which is still not yet fully supported in the GPs. For me the open question there is still which mixing rules to apply, so how to weight the explantory features due to the concentrations? Arithmetic mean, geometric mean ...? It would be really cool to integrate this into BoFire and we could brainstorm together how to do it in the best way.

Best,

Johannes

tatsuya-takakuwa commented 7 months ago

@jduerholt

Thank you very much once again. Regarding categorical descriptors, I managed to resolve it by creating a class that decomposes them when using Cross-validation. Thank you for your support!

As for the mixed rules, there remains discussion in the Baysian optimization community, but for now, creating composite descriptors and implementing recursive feature selection might be better. The support index that includes mixed instant and recursive feature selection has been very helpful. [Link: https://www.sciencedirect.com/science/article/pii/S0264127520307838?via%3Dihub]

I'm thinking of developing a method to generate simple composite variable combinations from categorical descriptors.

I have an additional question: is it possible to further classify the descriptors registered under categorical descriptors into continuous, discrete, and categorical?

Also, I understand that when experimental data is added via 'tell' in a strategy, it undergoes training.

Is it possible to set up cross-validation or LOO (Leave-One-Out). This is very important for ensuring extrapolation in small data, so I would like to add this setting.

Thank you very much for your assistance.

jduerholt commented 7 months ago

HI @tatsuya-takakuwa,

regarding the class that you wrote for cross validation, can you share it? I would be interested.

Regarding the descriptors, currently we support there only ordina ones (meaning continuous and discrete) and there is no furhter classification. But of course you can setup categorical molecular features and use for example mordred descriptors on the fly ...

Regarding tell: If you call tell, the surrogates models will be retrained on the whole dataset, but you can also instantiate the surrogate outside of the strategy and perform cross validation via surrogate.cross_validate:

https://github.com/experimental-design/bofire/blob/33a2053e80c371791cb27364c442b83f866b1c09/bofire/surrogates/trainable.py#L57

Within the BO loop, you can define frequency_hyperopt, then it uses CV within tell to select the best set of hyperparams and then train with the best hyperparams on the whole dataset.

https://github.com/experimental-design/bofire/blob/33a2053e80c371791cb27364c442b83f866b1c09/bofire/strategies/predictives/botorch.py#L78

This works for every surrogate that implements a so calles Hyperconfig as the SingleTaskGPSurrogate:

https://github.com/experimental-design/bofire/blob/33a2053e80c371791cb27364c442b83f866b1c09/bofire/data_models/surrogates/single_task_gp.py#L108

Was this helpful?

Best,

Johannes

tatsuya-takakuwa commented 7 months ago

@jduerholt Thank you for sharing. Here's the translation: The categorical descriptors are decomposed using the following function.

def decomposition_input_features(Categorical_discriptors):
    df = Categorical_discriptors.to_df()
    # Categorical_discriptors_name_to_list
    Categorical_discriptors_name = df.iloc[:, 1:].columns.tolist()

    discriptors_dict = {}
    for descriptor_name in Categorical_discriptors_name:
        discriptors_dict[descriptor_name] = ContinuousInput(key=descriptor_name, bounds=(df[descriptor_name].min(), df[descriptor_name].max()))

    discriptors_list = list(discriptors_dict.values())

    return discriptors_list

Subsequently, the Input_features are updated, and the ExperimentData is also updated. I conducted cross-validation using a RandomForest. As a result, I obtained outcomes similar to those attached. The performance metrics also remained unchanged.

input_features = Inputs(features = decomposition_input_features(Molecule))

train_cv, test_cv, pi = model.cross_validate(
    experiments, 
    folds=5, 
    hooks={"permutation_importance": permutation_importance_hook},
    #hooks={"permutation_importance": permutation_importance_hook, "lengthscale_importance": lengthscale_importance_hook}
    #hooks={"permutation_importance": permutation_importance_hook, "lengthscale_importance": lengthscale_importance_hook}

) newplot(12)

Thank you also for your advice regarding cross-validation. Adopting your second suggestion made the process simpler sobo_strategy_data_model = SoboStrategy(domain=domain,surrogate_specs=surrogate_specs,acquisition_function=qNEI(),folds=-1)

jduerholt commented 7 months ago

Regarding your last line, you also have to set the frequency_hyperopt to something larger than zero, only then it will use it ;)

The helper function regarding the cross val is smart! Good idea.