automl / amltk

A build-it-yourself AutoML Framework
https://automl.github.io/amltk/
BSD 3-Clause "New" or "Revised" License
68 stars 6 forks source link

Support for conditionals in optimizers like Optuna #289

Closed berombau closed 2 weeks ago

berombau commented 1 month ago

Is your feature request related to a problem? Please describe. Right now the Optuna parser does not support conditionals. The TODO says that this functionality is not yet supported as we can't encode this into a static Optuna search space. Is this support infeasible or possible and on the roadmap?

Describe the solution you'd like A way to still use Choice pipeline components and also Optuna. Ideally the resulting behaviour would be the same as when using Optuna and a Pythonic search space directly, as in the Optuna docs.

Describe alternatives you've considered Alternatively, you can create one component with a function that will call the choice list depending on a categorical variable, but then the other parameters are not as nicely passed only to the function where they make sense. So many configs in the space will be invalid.

eddiebergman commented 1 month ago

Heyo, thanks for reaching out about it!

So in theory, optuna does let you pythonicly sample configurations, and I imagine this would be the best way to expose conditionality to a user. I will try get back to you later this week with a potential solution!

Best, Eddie

berombau commented 3 weeks ago

Hi Eddie,

Any updates or ideas on this? Happy to help and work it out, just would want to know your take on it before I start diving into it more myself.

eddiebergman commented 3 weeks ago

Heyo, sorry for not getting back on this!

The main bottle-neck in implementation is really this line:

https://github.com/automl/amltk/blob/b680838a5a4d5f97c6be0d9c481b8f66d9e2fca9/src/amltk/optimization/optimizers/optuna.py#L295

The problem is that we'd like to return the space to someone, i.e. you can call component.search_space() and get something back that both represents your search space and can be passed to an Optimizer for use (in this case, the only one that matters is OptunaOptimizer.

So what is space when you ask for component.search_space("optuna")? Well it's essentially a a dict[str, optuna.distributions.BaseDistribution], i.e.

component = Component(...)
optuna_space: dict[str, BaseDistribution] = component.search_space("optuna")

What's nice about this optuna_space type definition is that we can pass it directly to self.study.ask(), where self is an instance of an OptunaOptimizer and it takes care of programmatically sampling from it for optimization.


For conditionals, I am aware you can sample conditionally from an Optuna.study.Study, however this is usually defined by a user. See this example for how optuna expects this to be done. The problem here lies in the fact AMLTK is aware of the conditional structure of the search space, while the user may no necessarily be, hence amltk should be the one to do the conditional sampling programmatically. This is fine to do within the OptunaOptimizer.ask(), however we would have to extend the type of optuna_space to also include information that ask() can use to do conditional sampling. Doing this extension will probably change the type signature to something like dict[str, BaseDistribution | Some_Type_Used_To_Represent_A_Conditional_AMLTK_Can_Use_To_Determine_How_To_Programmatically_Sample]

berombau commented 3 weeks ago

Hi Eddie, thanks for the overview! I would really like this feature and am thinking of implementing it. Currently thinking of this design, let me know if it's incorrect:

Current behaviour

Pipeline:

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from amltk.pipeline.components import Choice

cl1 = Component(RandomForestClassifier, space={"n_estimators": (10, 1000)})
cl2 = Component(DecisionTreeClassifier, space={"max_depth": (1, 100)})

pipeline = Sequential(
    Choice(cl1, cl2, name="model"),
)
pipeline.search_space(parser="configspace")

ConfigSpace:

Configuration space object:
  Hyperparameters:
    Seq-lwGKnsOu:model:DecisionTreeClassifier:max_depth, Type: UniformInteger, Range: [1, 100], Default: 50
    Seq-lwGKnsOu:model:RandomForestClassifier:n_estimators, Type: UniformInteger, Range: [10, 1000], Default: 505
    Seq-lwGKnsOu:model:__choice__, Type: Categorical, Choices: {DecisionTreeClassifier, RandomForestClassifier}, Default: DecisionTreeClassifier
  Conditions:
    Seq-lwGKnsOu:model:DecisionTreeClassifier:max_depth | Seq-lwGKnsOu:model:__choice__ == 'DecisionTreeClassifier'
    Seq-lwGKnsOu:model:RandomForestClassifier:n_estimators | Seq-lwGKnsOu:model:__choice__ == 'RandomForestClassifier'

OptunaParser:

pipeline.search_space(parser="optuna")
{'Seq-lwGKnsOu:model:DecisionTreeClassifier:max_depth': IntDistribution(high=100, log=False, low=1, step=1),
 'Seq-lwGKnsOu:model:RandomForestClassifier:n_estimators': IntDistribution(high=1000, log=False, low=10, step=1)}

OptunaOptimizer:

# implementation of ask
optuna_trial: optuna.Trial = self.study.ask(self.space) 
config = optuna_trial.params

Proposed new behaviour

Don't change type of optuna_space, only add __choice__ keys with CategoricalDistribution over Choice subcomponents.

OptunaParser:

pipeline.search_space(parser="optuna")
{'Seq-lwGKnsOu:model:__choice__': CategoricalDistribution([DecisionTreeClassifier, RandomForestClassifier]),
'Seq-lwGKnsOu:model:DecisionTreeClassifier:max_depth': IntDistribution(high=100, log=False, low=1, step=1),
 'Seq-lwGKnsOu:model:RandomForestClassifier:n_estimators': IntDistribution(high=1000, log=False, low=10, step=1)}

Change optimizer behaviour if __choice__ keys are present in the search space.

OptunaOptimizer:

# implementation of ask
if any(['__choice__' in k for k in  self.space.keys()]):
    optuna_trial: optuna.Trial = study.ask()
    # do all __choice__ suggestions with suggest_categorical
    # filter all parameters given the made choices
    # do all remaining suggestions with the correct suggest function
else:
    optuna_trial: optuna.Trial = study.ask(self.space)
config = optuna_trial.params

Will report back if the pseudocode of the ask function is workable.

eddiebergman commented 3 weeks ago

Yup, looks like a good solution! The only difficulty you'll encounter is how to deal with hierarchical choices, the below example is a tad dumb but just for reference:

cl1 = Component(RandomForestClassifier, space={"n_estimators": (10, 1000)})
cl2 = Component(DecisionTreeClassifier, space={"max_depth": (1, 100)})

pipeline = Sequential(
    Choice(
       Choice(cl1, cl2, name="inner_1"),
       cl1,
       name="top"
    )
)
pipeline.search_space(parser="configspace")

There is no correct solution and it depends on how you want to treat your search space:

If you consider the possible leaves, there are 3 possible end results, with cl1 having a 2/3 chance and cl2 having a 1/3 chance.

If you consider it hierarchically, then you have a 1/2 + (1/2*1/2) chance to choose cl1 and a (1/2*1/2) chance to choose cl2, i.e. 3/4 chance to choose cl1 and a 1/4 chance to choose cl2.

The exact numbers are only under uniform choice, and in reality it's dependent upon Optuna's TPE model which is likely to determine the sampling distribution for each categorical as it learns more over time.


I don't have a strong argument for either other than ConfigSpace's one is considered hierarchically, i.e. 3/4 and 1/4. It really depends on what Optuna suggests to do (which I don't know nor haven't looked up).