Closed berombau closed 2 weeks ago
Heyo, thanks for reaching out about it!
So in theory, optuna does let you pythonicly sample configurations, and I imagine this would be the best way to expose conditionality to a user. I will try get back to you later this week with a potential solution!
Best, Eddie
Hi Eddie,
Any updates or ideas on this? Happy to help and work it out, just would want to know your take on it before I start diving into it more myself.
Heyo, sorry for not getting back on this!
The main bottle-neck in implementation is really this line:
The problem is that we'd like to return the space
to someone, i.e. you can call component.search_space()
and get something back that both represents your search space and can be passed to an Optimizer
for use (in this case, the only one that matters is OptunaOptimizer
.
So what is space
when you ask for component.search_space("optuna")
? Well it's essentially a a dict[str, optuna.distributions.BaseDistribution]
, i.e.
component = Component(...)
optuna_space: dict[str, BaseDistribution] = component.search_space("optuna")
What's nice about this optuna_space
type definition is that we can pass it directly to self.study.ask()
, where self
is an instance of an OptunaOptimizer
and it takes care of programmatically sampling from it for optimization.
For conditionals, I am aware you can sample conditionally from an Optuna.study.Study
, however this is usually defined by a user. See this example for how optuna
expects this to be done. The problem here lies in the fact AMLTK is aware of the conditional structure of the search space, while the user may no necessarily be, hence amltk
should be the one to do the conditional sampling programmatically. This is fine to do within the OptunaOptimizer.ask()
, however we would have to extend the type of optuna_space
to also include information that ask()
can use to do conditional sampling. Doing this extension will probably change the type signature to something like dict[str, BaseDistribution | Some_Type_Used_To_Represent_A_Conditional_AMLTK_Can_Use_To_Determine_How_To_Programmatically_Sample]
Hi Eddie, thanks for the overview! I would really like this feature and am thinking of implementing it. Currently thinking of this design, let me know if it's incorrect:
Pipeline:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from amltk.pipeline.components import Choice
cl1 = Component(RandomForestClassifier, space={"n_estimators": (10, 1000)})
cl2 = Component(DecisionTreeClassifier, space={"max_depth": (1, 100)})
pipeline = Sequential(
Choice(cl1, cl2, name="model"),
)
pipeline.search_space(parser="configspace")
ConfigSpace:
Configuration space object:
Hyperparameters:
Seq-lwGKnsOu:model:DecisionTreeClassifier:max_depth, Type: UniformInteger, Range: [1, 100], Default: 50
Seq-lwGKnsOu:model:RandomForestClassifier:n_estimators, Type: UniformInteger, Range: [10, 1000], Default: 505
Seq-lwGKnsOu:model:__choice__, Type: Categorical, Choices: {DecisionTreeClassifier, RandomForestClassifier}, Default: DecisionTreeClassifier
Conditions:
Seq-lwGKnsOu:model:DecisionTreeClassifier:max_depth | Seq-lwGKnsOu:model:__choice__ == 'DecisionTreeClassifier'
Seq-lwGKnsOu:model:RandomForestClassifier:n_estimators | Seq-lwGKnsOu:model:__choice__ == 'RandomForestClassifier'
OptunaParser:
pipeline.search_space(parser="optuna")
{'Seq-lwGKnsOu:model:DecisionTreeClassifier:max_depth': IntDistribution(high=100, log=False, low=1, step=1),
'Seq-lwGKnsOu:model:RandomForestClassifier:n_estimators': IntDistribution(high=1000, log=False, low=10, step=1)}
OptunaOptimizer:
# implementation of ask
optuna_trial: optuna.Trial = self.study.ask(self.space)
config = optuna_trial.params
Don't change type of optuna_space
, only add __choice__
keys with CategoricalDistribution
over Choice subcomponents.
OptunaParser:
pipeline.search_space(parser="optuna")
{'Seq-lwGKnsOu:model:__choice__': CategoricalDistribution([DecisionTreeClassifier, RandomForestClassifier]),
'Seq-lwGKnsOu:model:DecisionTreeClassifier:max_depth': IntDistribution(high=100, log=False, low=1, step=1),
'Seq-lwGKnsOu:model:RandomForestClassifier:n_estimators': IntDistribution(high=1000, log=False, low=10, step=1)}
Change optimizer behaviour if __choice__
keys are present in the search space.
OptunaOptimizer:
# implementation of ask
if any(['__choice__' in k for k in self.space.keys()]):
optuna_trial: optuna.Trial = study.ask()
# do all __choice__ suggestions with suggest_categorical
# filter all parameters given the made choices
# do all remaining suggestions with the correct suggest function
else:
optuna_trial: optuna.Trial = study.ask(self.space)
config = optuna_trial.params
Will report back if the pseudocode of the ask function is workable.
Yup, looks like a good solution! The only difficulty you'll encounter is how to deal with hierarchical choices, the below example is a tad dumb but just for reference:
cl1 = Component(RandomForestClassifier, space={"n_estimators": (10, 1000)})
cl2 = Component(DecisionTreeClassifier, space={"max_depth": (1, 100)})
pipeline = Sequential(
Choice(
Choice(cl1, cl2, name="inner_1"),
cl1,
name="top"
)
)
pipeline.search_space(parser="configspace")
There is no correct solution and it depends on how you want to treat your search space:
If you consider the possible leaves, there are 3 possible end results, with cl1
having a 2/3
chance and cl2
having a 1/3
chance.
If you consider it hierarchically, then you have a 1/2 + (1/2*1/2)
chance to choose cl1
and a (1/2*1/2)
chance to choose cl2
, i.e. 3/4
chance to choose cl1
and a 1/4
chance to choose cl2
.
The exact numbers are only under uniform choice, and in reality it's dependent upon Optuna's TPE model which is likely to determine the sampling distribution for each categorical as it learns more over time.
I don't have a strong argument for either other than ConfigSpace's one is considered hierarchically, i.e. 3/4 and 1/4. It really depends on what Optuna suggests to do (which I don't know nor haven't looked up).
Is your feature request related to a problem? Please describe. Right now the Optuna parser does not support conditionals. The TODO says that this functionality is not yet supported as we can't encode this into a static Optuna search space. Is this support infeasible or possible and on the roadmap?
Describe the solution you'd like A way to still use Choice pipeline components and also Optuna. Ideally the resulting behaviour would be the same as when using Optuna and a Pythonic search space directly, as in the Optuna docs.
Describe alternatives you've considered Alternatively, you can create one component with a function that will call the choice list depending on a categorical variable, but then the other parameters are not as nicely passed only to the function where they make sense. So many configs in the space will be invalid.