CDonnerer / xgboost-distribution

Probabilistic prediction with XGBoost.
MIT License
100 stars 17 forks source link

Compatibility with scikit-learn API #59

Open kmedved opened 3 years ago

kmedved commented 3 years ago

Hello - thanks again for the wonderful package.

I wanted to ask whether it would make sense to adjust the current .predict API to mimic NGBoost's performance, of returning point predictions, while relying on .pred_dist() to return information about the distribution.

The advantage of this is mostly to increase compatibility with the rest of the scikit-learn ecosystem for the purposes of hyperparameter tuning and other testing. Right now, it's difficult to integrate xgboost-distribution with those tools because the .predict() call returns both the point predictions and distribution information at the same time (with a normal distribution).

This seems like a simple change, but I wanted to get your thoughts. Thanks.

CDonnerer commented 2 years ago

Hi, Thanks for raising this. My initial thinking here was that this gives the impression that the estimator is just like a normal regressor giving point estimates, which I wanted to avoid. However, I do see the appeal of being able to fit into the scikit-learn ecosystem. The "correct" way of tuning hyperparameters should probably use the negative log likelihood, but maybe there's an argument for being able to use something like RMSE. I'll have a look into this!

In the meantime, you could get to the above API by doing something like:

class XGBDistributionMean(XGBDistribution):        
    def predict(self, *args, **kwargs):
        preds = super().predict(*args, **kwargs)
        return preds.loc

    def predict_distribution(self, *args, **kwargs):
        return super().predict(*args, **kwargs)

which should be fully compatible with scikit-learn.