WillKoehrsen / feature-selector

Feature selector is a tool for dimensionality reduction of machine learning datasets
GNU General Public License v3.0
2.23k stars 768 forks source link

identify_collinear get wrong results when exsit features with 100% missing values #16

Open bison31205 opened 5 years ago

bison31205 commented 5 years ago

There are a situation,if my data have a feature with 100% missing values, or threshold like 98% missing values, call identify_collinear() will get more features with a correlation magnitude greater than the correlation_threshold.

I cheaked the result of pd.DataFrame.corr(), there were high correlation between some features and the feature with 98% missing values. So when call identify_all(),we will remove more features. We should removed the features with greater than threshold mising values at first, and then identify collinear. May be there are some better strategys.