Hi, I have been working on a surrogate model (Hyperboost) based on gradient boosting. This seems to outperform SMAC's random forest in most cases, while the training and querying of the surrogate model takes less time.

I think it could be interesting to built this into SMAC.

Concept

The basic idea is based on using virtual ensembles to retrieve uncertainty estimation. This functionality is built into CatBoost, which comes with fast training. You can create an EPM out of this that SMAC can use. A simplified version of the code would look like this:

def _train(X, Y):
  # This configuration was found after trying many different settings and running the optimizer on 
  # 31 datasets using a small random forest target algorithm.
  self.model = CatBoostRegressor(iterations=100, loss_function="RMSEWithUncertainty", posterior_sampling=False,
                                 verbose=False, random_seed=0, learning_rate=1.0, random_strength=0, 
                                 l2_leaf_reg=1)
  self.model.fit(X, Y)

def _predict(self, X):
  pred = self.catboost.predict(X)
  ensemble_preds = self.catboost.virtual_ensembles_predict(X, prediction_type="TotalUncertainty",
                                                           virtual_ensembles_count=20)
  knowledge = ensemble_preds[:, 1]

  # The knowledge uncertainty returned by virtual ensembles is much lower than using actual ensembles.
  # Scaling them up will approximately give numbers in the same size as the real ensembles.
  return pred[:, 0], knowledge ** 0.3

Benchmarking

I have created a benchmarking tool (Hyperbench) to test the final result on SMAC 1.4. This benchmark includes 54 datasets from OpenML-CC18, and uses target algorithms similar to the ones of HPOBench.

The configuration for this benchmark can be found here
The SMAC configuration can be found here
The target algorithms including their config spaces can be found here
The Hyperboost EPM can be found here

The results of the benchmark are (currently) included in the Hyperbench repository. You could clone it, install the dependencies and run streamlit run dashboard.py to view the results interactively (note: keep the budget type on iterations for an accurate result). Alternatively, you could have a look at the graphs I have included below.

The SMAC output is also stored inside the repository here

Results

Time required for EPM

Normalized average loss per iteration

XGBoost

Hyperboost outperforms the others

Random Forest

Hyperboost outperforms the others

SVM

SMAC outperforms Hyperboost

Stochastic gradient descent

Hyperboost outperforms SMAC

automl / SMAC3

Support for Hyperboost #915