aerdem4 / lofo-importance

Leave One Feature Out Importance
MIT License
816 stars 84 forks source link

Variable Grouping Only Works When Model Parameter is Kept To Default #56

Closed RyanWendt closed 8 months ago

RyanWendt commented 11 months ago

I was getting a weird error when passing the titanic dataset through lofo-importance, and I think I know why:

So take this code example. The inclusion of the parameter "model=rf" will trigger an error, whereas removing it will allow the lofo importance calculation to proceed with the default model.

import lofo
import seaborn as sns
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

titanic = sns.load_dataset('titanic')

df = titanic[['sex','survived']]
dataset = lofo.Dataset(df=df, target="survived", features=[col for col in df.columns if col != "survived"])
rf = RandomForestClassifier()
lofo_imp = lofo.LOFOImportance(dataset, model=rf, scoring='accuracy')
importance_df = lofo_imp.get_importance()

Root cause: Looking at lofo_importance.py it looks like at lines 32-33 it specifies that infer_model is only called when model =None. Any particular reason for this, or shouldn't this run regardless of the model passed to lofo_importance?

aerdem4 commented 11 months ago

You are feeding only one feature (sex). Is it intentional?

RyanWendt commented 11 months ago

Yes just to demonstrate the issue. If you add more features the issue is still reproducible.

aerdem4 commented 11 months ago

Thanks for creating the issue. One feature test is not a good test for LOFO because when you remove a feature, you get no feature model.

Besides, if you are going to use your own model instead of a custom one, you need to prepare the feature columns accordingly. It will fail if you feed columns with string dtypes. You can convert them to categories or label encode them in advance.

RyanWendt commented 11 months ago

True that one feature is not appropriate for a method like LOFO. My original example should have included one more feature.

It sounds like you are suggesting the conversion of the features should be done in advance as you normally would for a model in sklearn. But then I thought that was the point of the variable grouping feature was that it would do the label encoding for you?

aerdem4 commented 11 months ago

feature_groups are for users to provide custom groups of features they would like to be added/removed together. And auto_group_threshold is for creating these groups automatically with respect to correlations. Unfortunately, they don't make feature preprocessing to avoid assumptions.