automl / SMAC3

SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization
https://automl.github.io/SMAC3/v2.1.0/
Other
1.07k stars 220 forks source link

Replace random forest #1116

Open benjamc opened 3 months ago

benjamc commented 3 months ago

Issue: Installation of cpp difficult, replace by sth pythonic.

H.S.:

hadarshavit commented 1 month ago

I investigated it a bit more. In the original SMAC (see extended version: https://www.cs.ubc.ca/labs/algorithms/Projects/SMAC/papers/10-TR-SMAC.pdf, section 4.1 "Transformations of the Cost Metric") they explain the transformation in the aggregation of the leaves samples (which happens in line 222 in the current SMAC implementation https://github.com/automl/SMAC3/blob/9d194754a5fed3ec48be06987cfc24ee99b76af5/smac/model/random_forest/random_forest.py#L222)

Note that the current implementation computes each leaf value for every sample. It can also create huge matrices (the "preds_as_array" matrix).

I check the scikit-learn implementation of random forest. There is an option to set the DecesionTreeRegressor split to "random" instead of "best" (https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor), which I think is more similar to the SMAC implementation. To have the log-transformations, a change to the criterion is required (i.e, compute the node value in a different way https://github.com/scikit-learn/scikit-learn/blob/4aeb191100f409c880d033683972ab9f47963fa4/sklearn/tree/_criterion.pyx#L1032). Such change should be possible as different criteria already use different terminal values ("MSE and Poisson deviance both set the predicted value of terminal nodes to the learned mean value of the node whereas the MAE sets the predicted value of terminal nodes to the median" from https://scikit-learn.org/stable/modules/tree.html#regression-criteria)