Add feature importance calculation

Blue-Yonder-OSS / cyclic-boosting

implementation of Cyclic Boosting machine learning algorithms

Eclipse Public License 2.0

87 stars 15 forks source link

Add feature importance calculation #41

Closed lbventura closed 1 year ago

lbventura commented 1 year ago

By including set_feature_bin_weighted_average in CyclicBoostingBase cb_features. I think this is a rather inefficient way of doing this as the feature importances get computed in each iteration of _fit_main (https://github.com/Blue-Yonder-OSS/cyclic-boosting/blob/main/cyclic_boosting/base.py#L691), which is redundant as only the final feature importances are required. It was done because:

set_feature_bin_weighted_average has to be called before feature.unbind_data
it avoids introducing a new call to feature.bind_data(X, weights, False)

Alternatively, one could iterate over the features and 1) bind_data, 2) compute the weighted average, and then 3) unbind_data, for example, here https://github.com/Blue-Yonder-OSS/cyclic-boosting/blob/main/cyclic_boosting/base.py#L724

FelixWick commented 1 year ago

Yeah, we should not do the calculation in each iteration.

As quick implementation, you can go with your proposed alternative, it's not too expensive.

But in fact, we might also do a refactoring, all the unbinding and re-binding in the middle of the fitting is not needed, I think.

lbventura commented 1 year ago

Cool, I'll will follow your suggestion then and try to remove the feature.unbind_data which is currently preventing the computation outside of the loop.

lbventura commented 1 year ago

Just removed unbind_data and refactored.

FelixWick commented 1 year ago

It would be nice to include the estimation of feature importances for one of the integration tests, just as sanity check to see if the method produces reasonable results.

lbventura commented 1 year ago

It would be nice to include the estimation of feature importances for one of the integration tests, just as sanity check to see if the method produces reasonable results.

I am not 100% sure if this is what you are referring to: https://github.com/Blue-Yonder-OSS/cyclic-boosting/pull/41/files#diff-d8c7322bad71404522a489cb47c789fd4e6ec02b09f93563b51c1d599d4d9707R7

See screenshot below with how the results of the test look like: one can identify the features through the FeatureID.feature_group attribute, but this is not super clear as it does not provide information about the original feature name (say, day_of_week). Maybe we can edit FeatureID so that it includes the original feature name?

FelixWick commented 1 year ago

I mean to use the data in the integration test, because it is a quite realistic simulation and we expect the most informative features (which we know for this) to have highest feature importance. To your issue with the feature name, I think you already have access to this. There is just no feature names for the simple test you used. In the integration tests there should be names.

lbventura commented 1 year ago

I mean to use the data in the integration test, because it is a quite realistic simulation and we expect the most informative features (which we know for this) to have highest feature importance. To your issue with the feature name, I think you already have access to this. There is just no feature names for the simple test you used. In the integration tests there should be names.

Both issues have been addressed, I believe