Closed engelen closed 6 years ago
I do partially agree that we want the search to figure out itself how many features to use. What we're currently using is a flexible version of Breiman's heuristic to choose the number of features. What would be ideal was an interval from 1 feature to all 1000 features with 7 being the middle of that interval. However, I wouldn't know how to easily integrate this given our current code base. Ideally, we would like to have a truncated Gaussian, but I wouldn't know how to do 'correct' local pertubations for the local search. Do you have any proposals on how to solve this issue? We can also discuss this in person the next three days ;)
Recapping what we have discussed in person:
auto-sklearn
would search a fixed linear space, where the middle of this space corresponds to log2(m)
features, the minimal value corresponds to 1 feature, and the maximal value corresponds to m
features (i.e., all features).sqrt(m)
. It's also used as the default value of max_features
in sklearn and R (for classification).The heuristic from the third point would probably be the best choice: we could let SMAC optimize a float value alpha
on the interval [0,1], and transform it to max_features
by max_features=m^alpha
, which yields
alpha=0.0
=> max_features=1
alpha=0.5
=> max_features=sqrt(m)
(=Geurts's heuristic)alpha=1.0
=> max_features=m
This needs to be implemented for DecisionTree
, ExtraTrees
, and RandomForest
for classification. For regression, Geurts suggest using m/3
features, so we'd have to think about that.
Great suggestion and thanks for researching the source of the sqrt
-heuristic. I'm totally for changing the hyperparameter, but don't think you should change it for a single decision tree.
@engelen did you have any luck implementing this? Let me know if you need help.
I'm busy with some other stuff now, but I hope to get around to it next weekend (i.e., the 27-28th). By the way, I did manage to get an environment up and running locally now, so I should be able to to test it once I've implemented it.
Well, probably not this weekend either.. It's rather busy at the moment. I'll keep it on my todo, though, and see to it that it's done.
I can also take this over if you want to.
Thanks for the offer, but I should really get this done — since I've finally gotten my test environment up and running locally, it should be very little work (except for testing it). In fact, I'll make sure to get the initial pull request in tonight, so we can see what we need to test it. This is long overdue.
All right, a basic pull request has been made.
The RF and ET regression implementations both use max_features ranging from 0.1 to 1.0. We've discussed this before @mfeurer, but hadn't really reached a conclusion yet. I don't really mind if it stays at a regular float interval (instead of the exponent-based approach for classification) since that corresponds with Geurts' suggested default, but it should probably be allowed to go below 0.1.
Done & closing with https://github.com/automl/auto-sklearn/pull/377.
Looking at the source code of the Random Forest estimator, the search space for the
max_features
hyperparameter of sklearn'sRandomForestClassifier
seems rather limited.When defining the search space, the search space for a property called
max_features
is set to floats within the [0, 0.5] interval. Then, in theiterative_fit
method, this property is processed primary on line 56. The formula ismax_features * (ln(num_features) + 1)
. Asmax_features
's upper limit is 5 in the search space, this really limits the possible number of features to consider for high-dimensional input data. Fornum_features=1000
, for example the equation evaluates to 39 of the 1000 features as the maximum.Is this intentional? Shouldn't the configurator itself decide what's best in the end, without imposing such seemingly arbitrary upper bounds?