automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.59k stars 1.28k forks source link

Range of `max_features` in Random Forest seems highly limited #358

Closed engelen closed 6 years ago

engelen commented 7 years ago

Looking at the source code of the Random Forest estimator, the search space for the max_features hyperparameter of sklearn's RandomForestClassifier seems rather limited.

When defining the search space, the search space for a property called max_features is set to floats within the [0, 0.5] interval. Then, in the iterative_fit method, this property is processed primary on line 56. The formula is max_features * (ln(num_features) + 1). As max_features's upper limit is 5 in the search space, this really limits the possible number of features to consider for high-dimensional input data. For num_features=1000, for example the equation evaluates to 39 of the 1000 features as the maximum.

Is this intentional? Shouldn't the configurator itself decide what's best in the end, without imposing such seemingly arbitrary upper bounds?

mfeurer commented 7 years ago

I do partially agree that we want the search to figure out itself how many features to use. What we're currently using is a flexible version of Breiman's heuristic to choose the number of features. What would be ideal was an interval from 1 feature to all 1000 features with 7 being the middle of that interval. However, I wouldn't know how to easily integrate this given our current code base. Ideally, we would like to have a truncated Gaussian, but I wouldn't know how to do 'correct' local pertubations for the local search. Do you have any proposals on how to solve this issue? We can also discuss this in person the next three days ;)

engelen commented 7 years ago

Recapping what we have discussed in person:

  1. The current implementation is supposed to implement Breiman's heuristic as the default value, but uses the natural logarithm instead of the base-2 logarithm (the latter corresponds to Breiman's heuristic) on the number of features.
  2. If we want to use Breiman's heuristic, auto-sklearn would search a fixed linear space, where the middle of this space corresponds to log2(m) features, the minimal value corresponds to 1 feature, and the maximal value corresponds to m features (i.e., all features).
  3. A second option is to use the heuristic from this 2006 paper from Geurts et al, which is sqrt(m). It's also used as the default value of max_features in sklearn and R (for classification).

The heuristic from the third point would probably be the best choice: we could let SMAC optimize a float value alpha on the interval [0,1], and transform it to max_features by max_features=m^alpha, which yields

This needs to be implemented for DecisionTree, ExtraTrees, and RandomForest for classification. For regression, Geurts suggest using m/3 features, so we'd have to think about that.

mfeurer commented 7 years ago

Great suggestion and thanks for researching the source of the sqrt-heuristic. I'm totally for changing the hyperparameter, but don't think you should change it for a single decision tree.

mfeurer commented 6 years ago

@engelen did you have any luck implementing this? Let me know if you need help.

engelen commented 6 years ago

I'm busy with some other stuff now, but I hope to get around to it next weekend (i.e., the 27-28th). By the way, I did manage to get an environment up and running locally now, so I should be able to to test it once I've implemented it.

engelen commented 6 years ago

Well, probably not this weekend either.. It's rather busy at the moment. I'll keep it on my todo, though, and see to it that it's done.

mfeurer commented 6 years ago

I can also take this over if you want to.

engelen commented 6 years ago

Thanks for the offer, but I should really get this done — since I've finally gotten my test environment up and running locally, it should be very little work (except for testing it). In fact, I'll make sure to get the initial pull request in tonight, so we can see what we need to test it. This is long overdue.

engelen commented 6 years ago

All right, a basic pull request has been made.

The RF and ET regression implementations both use max_features ranging from 0.1 to 1.0. We've discussed this before @mfeurer, but hadn't really reached a conclusion yet. I don't really mind if it stays at a regular float interval (instead of the exponent-based approach for classification) since that corresponds with Geurts' suggested default, but it should probably be allowed to go below 0.1.

engelen commented 6 years ago

Done & closing with https://github.com/automl/auto-sklearn/pull/377.