autogluon / autogluon

Fast and Accurate ML in 3 Lines of Code
https://auto.gluon.ai/
Apache License 2.0
7.39k stars 876 forks source link

[tabular] Allow feature importance calculation during fitting when `num_bag_folds` >= 2 #4293

Open dluks opened 6 days ago

dluks commented 6 days ago

Description

When fitting bagged models with >= 2 folds, it may be redundant to set aside an additional set of validation data in order to calculate feature importance.

Current workflow

  1. Set aside a separate validation set (val) from train data (train)
  2. TabularPredictor.fit(train, num_bag_folds=10)
  3. Calculate feature importance with TabularPredictor.feature_importance(val)
  4. Refit full model with extra val data: TabularPredictor.refit_full(train_data_extra=val)

For large datasets, this can result in unnecessarily long training times due to the multiple fitting steps and not taking advantage of the built-in validation data in the form of the held-out folds. Additionally, this separates the feature importance calculation from the cross-validation process.

Requested workflow

When calling TabularPredictor.fit(), include a feature_importance flag which, when included in combination with num_bag_folds >=2, will automatically calculate feature importance for each held-out fold during training.

  1. TabularPredictor.fit(all_data, num_bag_folds=10, feature_importance=True, refit_full=True)

Semi-related to #3515

Innixma commented 4 days ago

Hi @dluks, thanks for creating an issue!

I actually really wanted to add support for this in AutoGluon when I first implemented the feature importance logic, but the more I thought about it the more I realized several key issues when trying to make it work with multi-layer stacking.

  1. To calculate the importance of a feature in a stacker model, you first need to get predictions from the base models with that specific feature shuffled in the same way to get the out-of-fold predictions. You would need to get the specific held-out fold indices for each fold model and only predict on those indices, then concatenate them together. This is complicated, but technically do-able with the correct implementation, albeit slower to compute than the holdout method.
  2. However, for performance optimization reasons, we only fit one model for KNN and random forest families, using internal implementation details to get an approximation to the out-of-fold prediction probabilities (efficient leave-one-out for KNN, out-of-bag for RF). To do (1) for these models would be very complicated, especially RF, and would require model-specific code. This is the main blocker in my mind, since there is no elegant model-agnostic solution.

We could partially avoid these issues by not re-calculating the base model out-of-fold predictions when computing feature importance with val data, but then the feature importances would technically be incorrect and potentially misleading, which I'd prefer to avoid (since the base model predictions would be computed on the original unshuffled version of the feature).

tl;dr: I believe it is technically possible to do this, but requires a good deal of development effort to achieve.