Hi, I have been working on a surrogate model (Hyperboost) based on gradient boosting. This seems to outperform SMAC's random forest in most cases, while the training and querying of the surrogate model takes less time.
I think it could be interesting to built this into SMAC.
Concept
The basic idea is based on using virtual ensembles to retrieve uncertainty estimation. This functionality is built into CatBoost, which comes with fast training. You can create an EPM out of this that SMAC can use. A simplified version of the code would look like this:
def _train(X, Y):
# This configuration was found after trying many different settings and running the optimizer on
# 31 datasets using a small random forest target algorithm.
self.model = CatBoostRegressor(iterations=100, loss_function="RMSEWithUncertainty", posterior_sampling=False,
verbose=False, random_seed=0, learning_rate=1.0, random_strength=0,
l2_leaf_reg=1)
self.model.fit(X, Y)
def _predict(self, X):
pred = self.catboost.predict(X)
ensemble_preds = self.catboost.virtual_ensembles_predict(X, prediction_type="TotalUncertainty",
virtual_ensembles_count=20)
knowledge = ensemble_preds[:, 1]
# The knowledge uncertainty returned by virtual ensembles is much lower than using actual ensembles.
# Scaling them up will approximately give numbers in the same size as the real ensembles.
return pred[:, 0], knowledge ** 0.3
Benchmarking
I have created a benchmarking tool (Hyperbench) to test the final result on SMAC 1.4. This benchmark includes 54 datasets from OpenML-CC18, and uses target algorithms similar to the ones of HPOBench.
The configuration for this benchmark can be found here
The results of the benchmark are (currently) included in the Hyperbench repository. You could clone it, install the dependencies and run streamlit run dashboard.py to view the results interactively (note: keep the budget type on iterations for an accurate result). Alternatively, you could have a look at the graphs I have included below.
The SMAC output is also stored inside the repository here
Hi, I have been working on a surrogate model (Hyperboost) based on gradient boosting. This seems to outperform SMAC's random forest in most cases, while the training and querying of the surrogate model takes less time.
I think it could be interesting to built this into SMAC.
Concept
The basic idea is based on using virtual ensembles to retrieve uncertainty estimation. This functionality is built into CatBoost, which comes with fast training. You can create an EPM out of this that SMAC can use. A simplified version of the code would look like this:
Benchmarking
I have created a benchmarking tool (Hyperbench) to test the final result on SMAC 1.4. This benchmark includes 54 datasets from OpenML-CC18, and uses target algorithms similar to the ones of HPOBench.
The results of the benchmark are (currently) included in the Hyperbench repository. You could clone it, install the dependencies and run
streamlit run dashboard.py
to view the results interactively (note: keep thebudget type
oniterations
for an accurate result). Alternatively, you could have a look at the graphs I have included below.Results
Time required for EPM
Normalized average loss per iteration
XGBoost
Random Forest
SVM
Stochastic gradient descent