ZainNasrullah / pollution-select-feature-selection

conceptualizing method for feature selection
MIT License
0 stars 1 forks source link

implement feature set sub-sampling #25

Open ZainNasrullah opened 4 years ago

ZainNasrullah commented 4 years ago

Rather than use all columns, incorporate subsampling from the set of all features. The complement set can then be used as a reasonable heuristic option for selecting the k features to shuffle/permute.

Note that this change would require vectorizing the success and failure counts to compute at a feature level but this should be fairly straightforward.

ZainNasrullah commented 4 years ago

Is feature set sampling useful in this algorithm?

In theory, feature sampling should provide a better estimate of which features individually lead to good performance with the limitation that feature interactions may be missed. I'm a little skeptical about the usefulness here because random forest (the base estimator) already has this characteristic baked into it via the max_features parameter. Wouldn't an additional sampling step in the feature selection algorithm only limit the set of features available to RF in that case (i.e., we are artificially decreasing max_features by excluding a fixed set of features)?