Open dluks opened 6 days ago
Hi @dluks, thanks for creating an issue!
I actually really wanted to add support for this in AutoGluon when I first implemented the feature importance logic, but the more I thought about it the more I realized several key issues when trying to make it work with multi-layer stacking.
We could partially avoid these issues by not re-calculating the base model out-of-fold predictions when computing feature importance with val data, but then the feature importances would technically be incorrect and potentially misleading, which I'd prefer to avoid (since the base model predictions would be computed on the original unshuffled version of the feature).
tl;dr: I believe it is technically possible to do this, but requires a good deal of development effort to achieve.
Description
When fitting bagged models with >= 2 folds, it may be redundant to set aside an additional set of validation data in order to calculate feature importance.
Current workflow
val
) from train data (train
)TabularPredictor.fit(train, num_bag_folds=10)
TabularPredictor.feature_importance(val)
val
data:TabularPredictor.refit_full(train_data_extra=val)
For large datasets, this can result in unnecessarily long training times due to the multiple fitting steps and not taking advantage of the built-in validation data in the form of the held-out folds. Additionally, this separates the feature importance calculation from the cross-validation process.
Requested workflow
When calling
TabularPredictor.fit()
, include afeature_importance
flag which, when included in combination withnum_bag_folds
>=2, will automatically calculate feature importance for each held-out fold during training.TabularPredictor.fit(all_data, num_bag_folds=10, feature_importance=True, refit_full=True)
Semi-related to #3515