Closed markdregan closed 1 year ago
Thank you!
It is probably not working.
Just to confirm, the sample weights would be passed to the random forest model. And not used by the shuffler.
Is this the intention?
PS: sorry for the delay, I was on a short holiday break :)
Hi - thanks for the reply.
It is two fold. the intention is for the model to have sample weights for the rows that are fed to it from the shuffler.
This is a broader problem in sklearn is seems. I read some PRs that suggest it is supported natively within cross_validate and other CV methods. ie. the sample weights vector is indexed and relevant subsets passed to each CV as appropriate.
My thinking for SelectByShuffling - is that fit_params would pass sample_weights to the CV shuffler that would natively support passing the right subsets of sample weights to the model when fitting per CV fold.
The method _check_fit_params
in sklearn.utils.validation
does what I describe above. And from tracing through cross_validate
source it looks to be implemented correctly.
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L1899
For SelectByShuffling, looks like cross_validate
from sklearn is used. But fit_params
are not passed: https://github.com/feature-engine/feature_engine/blob/main/feature_engine/selection/shuffle_features.py#L216
I can create a reproducible example showing cross_validate use sample_weights correctly if helpful - will comment back here.
Example showing how cross_validate
accepts sample_weight
within fit_params
.
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
X, y = make_classification(n_samples=8000, random_state=42)
sample_weight = np.random.random(size=len(y))
# Check shapes correct
X.shape, y.shape, sample_weight.shape
fit_params = {"sample_weight": sample_weight}
clf = RandomForestClassifier()
# Without sample weight
cross_validate(estimator=clf, X=X, y=y, cv=3)
# With sample weight
cross_validate(estimator=clf, X=X, y=y, cv=3, fit_params=fit_params)
On thing to note - the fit_params are not passed to the scorer in cross_validate
. So passing sample_weights
in fit_params
only influences training - and not respective scores. But I think that is ok - as the training part is the most important.
@solegalli - Hope the above is helpful. Looks like passing fit_params
to cross_validate
will work. But perhaps there are other constraints / issues I'm not aware of.
Describe the bug Documentation suggests that fit_params is supported for SelectByShuffling. But my tests suggest it is not supported. I have a working example below. Let me know if using incorrectly or if I can do any more tests to help.
Working example without fit_params
Example with fit_params not working
Expected behaviour
Expect for unpacked params within fit_params to be passed to underlying model / classifier
Error log
Desktop (please complete the following information):