Open angela97lin opened 4 years ago
Good points. A couple thoughts:
Conceptually, I don't think there's a problem with using one model/estimator to do feature selection for another. I'm sure there's ways that could produce poor performance, but we can measure that in the performance tests if we set them up well.
RE the encoding, we could build a feature selection component which internally could do whatever sort of encoding is required to get the component working, right? I guess AFAIK there's no fundamental block against doing something like this.
Currently, we only have SelectFromModel. It would be nice to support some feature selectors (ex: SelectKBest, SelectPercentile) that don't rely on an estimator and instead simply select features using statistical tests.
As I was working through the catboost PR (#247), I was not able to use a feature selector in the catboost pipeline because catboost boasts being able to handle categorical data, but the random forest classifier in our RFClassifierSelectFromModel feature selector could not, making it difficult to use in the catboost pipeline.