feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.8k stars 303 forks source link

fit_params not supported for SelectByShuffling (documentation suggests it is supported) #654

Closed markdregan closed 1 year ago

markdregan commented 1 year ago

Describe the bug Documentation suggests that fit_params is supported for SelectByShuffling. But my tests suggest it is not supported. I have a working example below. Let me know if using incorrectly or if I can do any more tests to help.

Working example without fit_params

from sklearn.ensemble import RandomForestClassifier
from feature_engine.selection import SelectByShuffling

X = pd.DataFrame(dict(x1 = [1000,2000,1000,1000,2000,3000], x2 = [1000,2000,1000,1000,2000,3000]))
y = pd.Series([1,0,0,1,1,0])

sbs = SelectByShuffling(RandomForestClassifier(random_state=42), cv=2, random_state=42)
sbs.fit_transform(X, y)

Example with fit_params not working

from sklearn.ensemble import RandomForestClassifier
from feature_engine.selection import SelectByShuffling

X = pd.DataFrame(dict(x1 = [1000,2000,1000,1000,2000,3000], x2 = [1000,2000,1000,1000,2000,3000]))
y = pd.Series([1,0,0,1,1,0])

sbs = SelectByShuffling(RandomForestClassifier(random_state=42), cv=2, random_state=42)

sample_weight = [1000,2000,1000,1000,2000,3000]
fit_p = {"sample_weight": sample_weight}

sbs.fit_transform(X, y, **fit_p)

Expected behaviour

Expect for unpacked params within fit_params to be passed to underlying model / classifier

Error log

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 sbs.fit_transform(X, y, **fit_p)

File ~/Dev/litmus/.env/lib/python3.10/site-packages/sklearn/base.py:870, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    867     return self.fit(X, **fit_params).transform(X)
    868 else:
    869     # fit method of arity 2 (supervised transformation)
--> 870     return self.fit(X, y, **fit_params).transform(X)

TypeError: SelectByShuffling.fit() got an unexpected keyword argument 'sample_weight'

Desktop (please complete the following information):

solegalli commented 1 year ago

Thank you!

It is probably not working.

Just to confirm, the sample weights would be passed to the random forest model. And not used by the shuffler.

Is this the intention?

PS: sorry for the delay, I was on a short holiday break :)

markdregan commented 1 year ago

Hi - thanks for the reply.

It is two fold. the intention is for the model to have sample weights for the rows that are fed to it from the shuffler.

This is a broader problem in sklearn is seems. I read some PRs that suggest it is supported natively within cross_validate and other CV methods. ie. the sample weights vector is indexed and relevant subsets passed to each CV as appropriate.

My thinking for SelectByShuffling - is that fit_params would pass sample_weights to the CV shuffler that would natively support passing the right subsets of sample weights to the model when fitting per CV fold.

markdregan commented 1 year ago

The method _check_fit_params in sklearn.utils.validation does what I describe above. And from tracing through cross_validate source it looks to be implemented correctly.

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L1899

For SelectByShuffling, looks like cross_validate from sklearn is used. But fit_params are not passed: https://github.com/feature-engine/feature_engine/blob/main/feature_engine/selection/shuffle_features.py#L216

I can create a reproducible example showing cross_validate use sample_weights correctly if helpful - will comment back here.

markdregan commented 1 year ago

Example showing how cross_validate accepts sample_weight within fit_params.

import numpy as np
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

X, y = make_classification(n_samples=8000, random_state=42)
sample_weight = np.random.random(size=len(y))

# Check shapes correct
X.shape, y.shape, sample_weight.shape

fit_params = {"sample_weight": sample_weight}
clf = RandomForestClassifier()

# Without sample weight
cross_validate(estimator=clf, X=X, y=y, cv=3)

# With sample weight
cross_validate(estimator=clf, X=X, y=y, cv=3, fit_params=fit_params)

On thing to note - the fit_params are not passed to the scorer in cross_validate. So passing sample_weights in fit_params only influences training - and not respective scores. But I think that is ok - as the training part is the most important.

@solegalli - Hope the above is helpful. Looks like passing fit_params to cross_validate will work. But perhaps there are other constraints / issues I'm not aware of.