Range of `max_features` in Random Forest seems highly limited

engelen commented 7 years ago

Looking at the source code of the Random Forest estimator, the search space for the max_features hyperparameter of sklearn's RandomForestClassifier seems rather limited.

When defining the search space, the search space for a property called max_features is set to floats within the [0, 0.5] interval. Then, in the iterative_fit method, this property is processed primary on line 56. The formula is max_features * (ln(num_features) + 1). As max_features's upper limit is 5 in the search space, this really limits the possible number of features to consider for high-dimensional input data. For num_features=1000, for example the equation evaluates to 39 of the 1000 features as the maximum.

Is this intentional? Shouldn't the configurator itself decide what's best in the end, without imposing such seemingly arbitrary upper bounds?

mfeurer commented 7 years ago

I do partially agree that we want the search to figure out itself how many features to use. What we're currently using is a flexible version of Breiman's heuristic to choose the number of features. What would be ideal was an interval from 1 feature to all 1000 features with 7 being the middle of that interval. However, I wouldn't know how to easily integrate this given our current code base. Ideally, we would like to have a truncated Gaussian, but I wouldn't know how to do 'correct' local pertubations for the local search. Do you have any proposals on how to solve this issue? We can also discuss this in person the next three days ;)

engelen commented 7 years ago

Recapping what we have discussed in person:

The current implementation is supposed to implement Breiman's heuristic as the default value, but uses the natural logarithm instead of the base-2 logarithm (the latter corresponds to Breiman's heuristic) on the number of features.
If we want to use Breiman's heuristic, auto-sklearn would search a fixed linear space, where the middle of this space corresponds to log2(m) features, the minimal value corresponds to 1 feature, and the maximal value corresponds to m features (i.e., all features).
A second option is to use the heuristic from this 2006 paper from Geurts et al, which is sqrt(m). It's also used as the default value of max_features in sklearn and R (for classification).

The heuristic from the third point would probably be the best choice: we could let SMAC optimize a float value alpha on the interval [0,1], and transform it to max_features by max_features=m^alpha, which yields

alpha=0.0 => max_features=1
alpha=0.5 => max_features=sqrt(m) (=Geurts's heuristic)
alpha=1.0 => max_features=m

This needs to be implemented for DecisionTree, ExtraTrees, and RandomForest for classification. For regression, Geurts suggest using m/3 features, so we'd have to think about that.

mfeurer commented 7 years ago

Great suggestion and thanks for researching the source of the sqrt-heuristic. I'm totally for changing the hyperparameter, but don't think you should change it for a single decision tree.

mfeurer commented 6 years ago

@engelen did you have any luck implementing this? Let me know if you need help.