aerdem4 / lofo-importance

Leave One Feature Out Importance
MIT License
816 stars 84 forks source link

How to use GroupKFold? #28

Closed RainFung closed 4 years ago

aerdem4 commented 4 years ago

You can actually provide any (train_index, test_index) iterator to the cv parameter. sklearn's crossvalidate function accepts both kfold objects and iterators (kfold object's split outputs) as inputs. Example would be:

lofo_imp = LOFOImportance(dataset, cv=GroupKFold(4).split(X, y, groups), scoring="roc_auc")
RainFung commented 4 years ago

Thanks. It's better to add some document about it.

BartlomiejSkwira commented 3 years ago

Sklearn cross_validate function (which is used by lofo-imortance in LOFOImportance._get_cv_score) has a groups keyword argument, I forked this repo and added it there. You can see it in this PR https://github.com/BartlomiejSkwira/lofo-importance/pull/1 (it's a work in progress, requires tests)

@aerdem4 would it be a good PR candidate to your repo?

aerdem4 commented 3 years ago

@BartlomiejSkwira GroupKFold is supported with the workaround above. Your PR looks nice but it only covers one out of many validation schemes. From minimalistic point of view, I am thinking maybe keeping the repo without exceptions is better. But if you have an idea to include most common validation schemes in a generic way, you are welcome.

BartlomiejSkwira commented 3 years ago

@aerdem4 This workaround did't work for me, I would get a:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
    ...
<my code calling lofo_importance.get_importance()>
   ...
  File "/opt/conda/lib/python3.8/site-packages/lofo/lofo_importance.py", line 85, in get_importance
    lofo_cv_scores.append(self._get_cv_score(feature_to_remove=f))
  File "/opt/conda/lib/python3.8/site-packages/lofo/lofo_importance.py", line 59, in _get_cv_score
    cv_results = cross_validate(self.model, X, y, cv=self.cv, scoring=self.scoring, fit_params=fit_params, groups=self.groups)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 260, in cross_validate
    results = _aggregate_score_dicts(results)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 1675, in _aggregate_score_dicts
    for key in scores[0]
IndexError: list index out of range

I wonder if I used it correctly, here is how I used lofo:

pipe = pipeline.Pipeline(steps=[("cls", ensemble.RandomForestClassifier(random_state=RANDOM_STATE))])
cv = model_selection.GroupKFold(n_splits=N_SPLITS)
search = model_selection.GridSearchCV(
    pipe,
    param_grid,
    n_jobs=-1,
    scoring=scoring,
    cv=cv,
    verbose=0,
    refit=true,
)
search.fit(X, y, groups=groups)
dataset = Dataset(
        df=df,
        target="some_target",
        features=attribute_columns,
)

# define the validation scheme and scorer.
lofo_importance = LOFOImportance(
    dataset,
    cv=cv.split(X, y, groups),
    scoring=scoring,
    model=search.best_estimator_,
    n_jobs=n_jobs,
    # groups=groups,
)

# get the mean and standard deviation of the importances in pandas format
importance_df = lofo_importance.get_importance() # this line throws an exeption
aerdem4 commented 3 years ago

Can you check the length of generated list in cv.split just before feeding it to LOFO? The functions you use before can mutate cv and cv.split may return an empty list.

graceyangfan commented 3 years ago

@aerdem4 I meet this error when use groupkfold ' In cv_results = cross_validate(self.model, X, y, cv=self.cv, scoring=self.scoring, fit_params=fit_params)

ValueError: not enough values to unpack (expected 3, got 0) '

aerdem4 commented 3 years ago

@graceyangfan How do you use groupkfold? Like the way I suggested? Can you check the input or share a reproducible code?

Quetzalcohuatl commented 3 years ago

Getting the same error as Grace.

lofo_imp = LOFOImportance(dataset, cv=GroupKFold(n_splits=4).split(X=tr, y=tr['pressure'], groups=tr['breath_id']), scoring="neg_mean_absolute_error")

ValueError: not enough values to unpack (expected 3, got 0)

aerdem4 commented 3 years ago

New sklearn version seems to have problems with iterables in cross_validate. Converting iterables to list is a workaround:

lofo_imp = LOFOImportance(dataset, cv=list(GroupKFold(n_splits=4).split(X=tr, y=tr['pressure'], groups=tr['breath_id'])), scoring="neg_mean_absolute_error")