automl / SMAC3

SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization
https://automl.github.io/SMAC3/v2.1.0/
Other
1.05k stars 219 forks source link

Support for Hyperboost #915

Open Yatoom opened 1 year ago

Yatoom commented 1 year ago

Hi, I have been working on a surrogate model (Hyperboost) based on gradient boosting. This seems to outperform SMAC's random forest in most cases, while the training and querying of the surrogate model takes less time.

I think it could be interesting to built this into SMAC.

Concept

The basic idea is based on using virtual ensembles to retrieve uncertainty estimation. This functionality is built into CatBoost, which comes with fast training. You can create an EPM out of this that SMAC can use. A simplified version of the code would look like this:

def _train(X, Y):
  # This configuration was found after trying many different settings and running the optimizer on 
  # 31 datasets using a small random forest target algorithm.
  self.model = CatBoostRegressor(iterations=100, loss_function="RMSEWithUncertainty", posterior_sampling=False,
                                 verbose=False, random_seed=0, learning_rate=1.0, random_strength=0, 
                                 l2_leaf_reg=1)
  self.model.fit(X, Y)

def _predict(self, X):
  pred = self.catboost.predict(X)
  ensemble_preds = self.catboost.virtual_ensembles_predict(X, prediction_type="TotalUncertainty",
                                                           virtual_ensembles_count=20)
  knowledge = ensemble_preds[:, 1]

  # The knowledge uncertainty returned by virtual ensembles is much lower than using actual ensembles.
  # Scaling them up will approximately give numbers in the same size as the real ensembles.
  return pred[:, 0], knowledge ** 0.3 

Benchmarking

I have created a benchmarking tool (Hyperbench) to test the final result on SMAC 1.4. This benchmark includes 54 datasets from OpenML-CC18, and uses target algorithms similar to the ones of HPOBench.

The results of the benchmark are (currently) included in the Hyperbench repository. You could clone it, install the dependencies and run streamlit run dashboard.py to view the results interactively (note: keep the budget type on iterations for an accurate result). Alternatively, you could have a look at the graphs I have included below.

Results

Time required for EPM

image

Normalized average loss per iteration

XGBoost

image Hyperboost outperforms the others

Random Forest

image Hyperboost outperforms the others

SVM

image SMAC outperforms Hyperboost

Stochastic gradient descent

image Hyperboost outperforms SMAC

alexandertornede commented 1 year ago

Hi!

Thanks for this issue! We are sorry that no one had time to look into this until now, but we will do that within the next weeks.