Closed RainFung closed 4 years ago
Thanks. It's better to add some document about it.
Sklearn cross_validate
function (which is used by lofo-imortance
in LOFOImportance._get_cv_score
) has a groups
keyword argument, I forked this repo and added it there. You can see it in this PR https://github.com/BartlomiejSkwira/lofo-importance/pull/1 (it's a work in progress, requires tests)
@aerdem4 would it be a good PR candidate to your repo?
@BartlomiejSkwira GroupKFold is supported with the workaround above. Your PR looks nice but it only covers one out of many validation schemes. From minimalistic point of view, I am thinking maybe keeping the repo without exceptions is better. But if you have an idea to include most common validation schemes in a generic way, you are welcome.
@aerdem4 This workaround did't work for me, I would get a:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
...
<my code calling lofo_importance.get_importance()>
...
File "/opt/conda/lib/python3.8/site-packages/lofo/lofo_importance.py", line 85, in get_importance
lofo_cv_scores.append(self._get_cv_score(feature_to_remove=f))
File "/opt/conda/lib/python3.8/site-packages/lofo/lofo_importance.py", line 59, in _get_cv_score
cv_results = cross_validate(self.model, X, y, cv=self.cv, scoring=self.scoring, fit_params=fit_params, groups=self.groups)
File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 260, in cross_validate
results = _aggregate_score_dicts(results)
File "/opt/conda/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 1675, in _aggregate_score_dicts
for key in scores[0]
IndexError: list index out of range
I wonder if I used it correctly, here is how I used lofo:
pipe = pipeline.Pipeline(steps=[("cls", ensemble.RandomForestClassifier(random_state=RANDOM_STATE))])
cv = model_selection.GroupKFold(n_splits=N_SPLITS)
search = model_selection.GridSearchCV(
pipe,
param_grid,
n_jobs=-1,
scoring=scoring,
cv=cv,
verbose=0,
refit=true,
)
search.fit(X, y, groups=groups)
dataset = Dataset(
df=df,
target="some_target",
features=attribute_columns,
)
# define the validation scheme and scorer.
lofo_importance = LOFOImportance(
dataset,
cv=cv.split(X, y, groups),
scoring=scoring,
model=search.best_estimator_,
n_jobs=n_jobs,
# groups=groups,
)
# get the mean and standard deviation of the importances in pandas format
importance_df = lofo_importance.get_importance() # this line throws an exeption
Can you check the length of generated list in cv.split just before feeding it to LOFO? The functions you use before can mutate cv and cv.split may return an empty list.
@aerdem4 I meet this error when use groupkfold ' In cv_results = cross_validate(self.model, X, y, cv=self.cv, scoring=self.scoring, fit_params=fit_params)
ValueError: not enough values to unpack (expected 3, got 0) '
@graceyangfan How do you use groupkfold? Like the way I suggested? Can you check the input or share a reproducible code?
Getting the same error as Grace.
lofo_imp = LOFOImportance(dataset, cv=GroupKFold(n_splits=4).split(X=tr, y=tr['pressure'], groups=tr['breath_id']), scoring="neg_mean_absolute_error")
ValueError: not enough values to unpack (expected 3, got 0)
New sklearn version seems to have problems with iterables in cross_validate. Converting iterables to list is a workaround:
lofo_imp = LOFOImportance(dataset, cv=list(GroupKFold(n_splits=4).split(X=tr, y=tr['pressure'], groups=tr['breath_id'])), scoring="neg_mean_absolute_error")
You can actually provide any (train_index, test_index) iterator to the cv parameter. sklearn's crossvalidate function accepts both kfold objects and iterators (kfold object's split outputs) as inputs. Example would be: