WillKoehrsen / feature-selector

Feature selector is a tool for dimensionality reduction of machine learning datasets
GNU General Public License v3.0
2.23k stars 768 forks source link

Removing my label column during removal? #25

Open windowshopr opened 5 years ago

windowshopr commented 5 years ago

I'm at a loss as to what's happening here.

I'm downloading historical stock data with Pandas Datareader, and after some small manipulations (ie. re-arranging the dataframe, adding moving averages, etc.), I pass the dataframe to FeatureTools to do a quick Auto Feature Engineering, which it does fine by adding new columns to the dataframe...

BUT then I pass it to FeatureSelector (to remove all columns that are highly correlated, have no importance, etc.) but I receive an issue where the next step in my process cannot find the "label" column in the dataset that I'm trying to point it to anymore (Adj Close). I'm new to FeatureSelector so I'm not entirely sure how to use it yet. From there, it will pass the data on to TPOT to do an Auto Regression.

I have included my full code here, I know you're not supposed to, but it will be a working code for anyone to be able to try and see my issue on their side. The error I get is:

KeyError: "labels ['Adj Close'] not contained in axis"

It would appear that FeatureSelector is removing the "Adj Close" label/column during the removal step, but I thought that was why we assign it to the internal "label=" part? Also, I'm wondering if I could only use the added features from the previous FeatureTools part as I don't want the values from the original historical dataset removed, but I'd still have to label the Adj Close column as my target. So how do I get it to not delete my target/label?

Any suggestions would be great. Would love to get this working. Just type in a ticker symbol to get started (ex. CLVS). Thanks!

main.txt

windowshopr commented 5 years ago

As a workaround, I just re-added the df column with the adj close in it, after the removal process, like so:

# Trying to now name all the feature columns and label for FeatureSelector...
features = df.drop("Adj Close", axis=1)
label = df["Adj Close"]
# Now, drop all columns of low importance
from feature_selector import FeatureSelector
fs = FeatureSelector(data = features, labels = label)
fs.identify_all(selection_params = {'missing_threshold': 0.6,    
                                    'correlation_threshold': 0.98, 
                                    'task': 'regression',    
                                    'eval_metric': 'mse', 
                                    'cumulative_importance': 0.99})
all_to_remove = fs.check_removal()
print(all_to_remove[:])
df = fs.remove(methods = 'all')

# Re-Add the Adj Close to the df because FeatureTools removes it once you assign it as the label for some reason
df['Adj Close'] = label
wbgreen0405 commented 4 years ago

@windowshopr Thank you! I was trying to figure out how to read add my target column.