YunBAI-PSL commented 4 months ago

Dear authors,

This is a very nice work! And I made it run on my laptop successfully. There is a thing that I feel a little bit confused.

On example 14, besides the codes the authors provide, I experiment on plotting the reliability lines for PGM and Quantile Gradiant Machines model. Here are the two reliability plots.

The ideal reliability is like a straight line. But from the figures I plotted, the predicted probability and observed frequency do not always match. Does this mean that this probability forecasting is not so reliable? It would be great if the authors could offer any explanation.

My codes are as follows:

plot reliability for both pgb and quantile gradient boosting

y_pred_quantiles_dict = dict() scatter_points = [] y_dist = all_models["mse"].sample(y_mean, y_std, n_estimates=10_000) for q in quantiles: y_pred = np.quantile(y_dist, q, axis=0) y_pred_qr = all_models["q "+"{:.2f}".format(q)].predict(xx) observed_freq = np.sum(y <= y_pred) / len(y) scatter_points.append((q, observed_freq))

scatter_points.sort(key=lambda x: q)

plt.figure(figsize=(10, 6)) x_values, y_values = zip(*scatter_points) plt.plot(x_values, y_values, marker='o', color='red',label='Probabilistic Gradient Boosting') plt.plot([0, 1], [0, 1], 'k--')

label='Perfectly calibrated'

plt.xlabel('Predicted Probability') plt.ylabel('Observed Frequency') plt.title('Reliability - Probabilistic Gradient Boosting') plt.legend() plt.show()

elephaint commented 4 months ago

Hi,

Thanks for your kind words and using the package.

In practice, it is nearly always the case that the observed frequency does not match entirely with the predicted frequency on most datasets (that's because it's a model and thus an approximation of the underlying process). Personally I think the curve for PGBM looks pretty reasonable and slightly better than the Quantile method for this example.

To improve performance, there are ways of 'calibrating' the predicted frequency to the observed frequency when using PGBM, for example:

Use a different distribution;
Play with the tree_correlation hyperparameter;

Another method that may get you better calibrated predictions is conformalized quantile regression. In that case, you create separate models for every quantile but the quantile scores are calibrated based on a held-out validation set. I think this may give even better results than PGBM, but (again) at the cost of requiring many models (one for every quantile) as in the case of quantile gradient boosting, plus some additional computational cost (for calculating quantiles scores on held-out validation sets). I didn't include that method in the example (yet). Note that for time series problems though, I know that these conformal predictors don't necessarily lead to better predicted distributions (link to the work that shows this may follow later, a colleague has worked on this but I don't think it's been published yet).

Hope this helps, let me know if you have any further questions.

YunBAI-PSL commented 4 months ago

Hi,

Thanks for this nice reply! I also calculated the spread skill ratio (SSR), which is std of member/rmse of ensemble mean, and the result is 0.27<1. This means the spread of the ensemble members is not enough. So it might be better to use crps as a loss function during training. But it seems I cannot use the customized loss functions in the code. I look forward to you updating this version. :)

Thank you again for the reply!

elephaint commented 4 months ago

Ok - what do you mean with:

But it seems I cannot use the customized loss functions in the code. I look forward to you updating this version. :)

What is the issue?

YunBAI-PSL commented 4 months ago

I want to change the "square_error" with the CRPS I defined, but with the error of InvalidParameterError: The 'loss' parameter of HistGradientBoostingRegressor must be a str among {'poisson', 'squared_error', 'gamma', 'quantile', 'absolute_error'} or an instance of 'sklearn._loss.loss.BaseLoss'. Got <function crps_loss at 0x14d13fca0> instead. Because I guess using CRPS may improve the reliability. :)

gbr_ls = HistGradientBoostingRegressor( loss="squared_error", **common_params )

elephaint commented 4 months ago

Yeah that's not possible (in general). You can't directly optimize the (ensemble) CRPS, as you can't compute the gradient of the CRPS. That's why to optimize the CRPS, I'd advise to do it in a hyperparameter optimization loop, such as detailed in this example.

But interested to hear if you think otherwise!

valeman commented 3 months ago

Puzzling to see such methods - it is proven that they don't work as they 1) don't have any mathematical guarantees of validity. 2) rely on unrealistic assumptions and parametric methods.

Relying on normality assumptions and parametric methods can not produce good prediction intervals. This is a fact both mathematical and empirical.

I disagree with such statement -> 'In practice, it is nearly always the case that the observed frequency does not match entirely with the predicted frequency on most datasets (that's because it's a model and thus an approximation of the underlying process). Personally I think the curve for PGBM looks pretty reasonable and slightly better than the Quantile method for this example.'

Such statement is 1) incorrect 2) disingenuous as alternatives such as conformal prediction work perfectly well not only empirically but also have theoretical guarantees of validity! So the statement 'it is nearly always the case that the observed frequency does not match entirely with the predicted frequency on most datasets' is well FALSE.

Of course if one models uncertainty using incorrect assumptions such as normality and parametric assumptions things will never work well, whether it is for this model or alternative models such as NGBoost.

https://medium.com/@valeman/does-ngboost-work-evaluating-ngboost-against-critical-criteria-for-good-probabilistic-prediction-28c4871c1bab

A much better alternative https://github.com/valeman/awesome-conformal-prediction including Conformalized Predictive Distributions that produce the whole well calibrated CDF by default.

https://valeman.medium.com/how-to-predict-quantiles-in-a-more-intelligent-way-or-bye-bye-quantile-regression-hello-24a65e4c50f

https://valeman.medium.com/how-to-predict-full-probability-distribution-using-machine-learning-conformal-predictive-f8f4d805e420

elephaint commented 3 months ago

@valeman

1) I actually recommend conformal prediction to this user for this problem, as I agree it can produce better intervals (not in every case, but definitely so for problems where distributions between train and test remain more or less the same). Happy to discuss if you think otherwise.

2) My statement isn't false: in practice very often test distributions do not follow train/validation distributions exactly so yes, the observed quantiles will generally differ from the test quantiles (even when having used a calibration set!). For time series, this is nearly always the case. It's simply a result of train-test shift. No method - not even conformal prediction methods - can correct for that.

3) The statements you make hurt and feel like you are personally attacking me. I am not disingenous. I am open to an honest discussion, but please refrain from making incorrect assumptions of me being disingenous. I don't really understand what makes you engage with me like this (opening a separate non-issue, responding here like this, I don't get it). What did I do to deserve such a treatment?

valeman commented 3 months ago

Nothing personal here, I just think the statement that was made is fundamentally flawed.

Of course if the data is changing completely it is more difficult, but this is not what the OP wrote about.

The method proposed can't even handle IID data well which is very clear from the plots. It just strikes me as odd that people would make statements in the research papers only the reality to turn out to be vastly different.

Even then conformal prediction for nonexchangeable data works very well including for time series as I am sure you should be familiar about? https://arxiv.org/pdf/2202.13415

But this exchange is not about time series or non exchangeable data. It is about the methods that don't work when the authors claim in the papers that they are sota and things like that.

This is from your paper "We empirically demonstrate the advantages of PGBM compared to existing state-of-the-art methods: (i) PGBM enables probabilistic estimates without compromising on point performance in a single model, (ii) PGBM learns probabilistic estimates via a single model only (and without requiring multi-parameter boosting), and thereby offers a speedup of up to several orders of magnitude over existing state-of-the-art methods on large datasets, and (iii) PGBM achieves accurate probabilistic estimates in tasks with complex differentiable loss functions, such as hierarchical time series problems, where we observed up to 10% improvement in point forecasting performance and up to 300% improvement in probabilistic forecasting performance.'

Do you really honestly believe this? Because this is simply untrue and the paper results were basically on the basis of incorrect and deeply flawed benchmarking with another model that does not work - NGBoost.

elephaint commented 3 months ago

Nothing personal here, I just think the statement that was made is fundamentally flawed.

Thanks for clarifying 👍

Of course if the data is changing completely it is more difficult, but this is not what the OP wrote about.

The OP asked about observed frequency to which I replied that entirely matching all the frequencies on the test data nearly never happens in practice. For reference, my words:

In practice, it is nearly always the case that the observed frequency does not match entirely with the predicted frequency on most datasets (that's because it's a model and thus an approximation of the underlying process).

I still don't understand why you consider this statement false. (I think it's correct, even in light of conformal prediction methods / the paper you provided). For example, if a conformal prediction method could guarantee perfect coverage, Fig. 1 of the paper you quoted would show a perfect straight line for the proposed method and/or the standard conformal prediction method. It doesn't. I'd consider Figure 1 of the paper you provided a nice example of my statement being true - that the observed frequency does not entirely match the predicted frequency.

Also from the paper's Introduction:

We see that over a substantial stretch of time, conformal prediction loses coverage, its intervals decreasing far below the target 90% coverage level, while our proposed method, nonexchangeable conformal prediction, is able to maintain approximately the desired coverage level.

Note the word approximately. This is precisely my point. Approximately the desired coverage level != entirely the desired coverage level.

The method proposed can't even handle IID data well which is very clear from the plots. It just strikes me as odd that people would make statements in the research papers only the reality to turn out to be vastly different.

None of the claims we make in the paper are not supported by evidence from the paper, so I'm not sure I understand the objection here. Again - and I feel I need to reiterate this - conformal prediction methods can work better. I never made a secret of this. I literally recommended conformal prediction methods to this user.

Even then conformal prediction for nonexchangeable data works very well including for time series as I am sure you should be familiar about? https://arxiv.org/pdf/2202.13415

But this exchange is not about time series or non exchangeable data. It is about the methods that don't work when the authors claim in the papers that they are sota and things like that.

This is from your paper "We empirically demonstrate the advantages of PGBM compared to existing state-of-the-art methods: (i) PGBM enables probabilistic estimates without compromising on point performance in a single model, (ii) PGBM learns probabilistic estimates via a single model only (and without requiring multi-parameter boosting), and thereby offers a speedup of up to several orders of magnitude over existing state-of-the-art methods on large datasets, and (iii) PGBM achieves accurate probabilistic estimates in tasks with complex differentiable loss functions, such as hierarchical time series problems, where we observed up to 10% improvement in point forecasting perfor�mance and up to 300% improvement in probabilistic forecasting performance.'

Do you really honestly believe this? Because this is simply untrue and the paper results were basically on the basis of incorrect and deeply flawed benchmarking with another model that does not work - NGBoost.

Yes. We don't claim 'this is the best method for probabilistic estimates in all cases, all the time, every time'. Please read carefully. What we claim is:

PGBM enables probabilistic estimates without compromising on point performance in a single model. This is true and supported by the evidence in the paper.
PGBM learns probabilistic estimates via a single model only (and without requiring multi-parameter boosting) [...] Again true, as supported by the paper
PGBM achieves accurate probabilistic estimates in tasks with complex differentiable loss functions [...]. Again true, as supported by the paper.

It seems perhaps you are reading things that we simply don't claim. We're not claiming the best probabilistic estimates ever, all the time, every case, ever and ever. What we claim is that it can achieve accurate probabilistic estimates, which is empirically true.

Now, to clarify:

Can PGBM achieve accurate probabilistic estimates in all cases? No.
Do we provide mathematical guarantees? No. (interestingly, you could argue that there is a hint of conformal prediction in PGBM as we suggest to choose the output distribution based on (the observed CRPS of) a held-out validation set).
Are conformal prediction methods likely better (in terms of probabilistic accuracy) in most cases? Yes.
Are conformal prediction methods always better, in every case? I don't know, I don't think so. There are cases where users may want a single model only that can produce a full output distribution, at the expense of a bit of probabilistic accuracy but with the added interpretability of having a fixed output distribution.
Are conformal prediction methods the only thing we should every use from now on, and should we dismiss any other work? No. I still think there are advantages to using parametric methods. And from a research perspective we should definitely keep exploring different paths.

I think none of what is in the paper is in contradiction with the statements above, or in contradiction with anything I've said here. But please point out if you think otherwise.

Personally, I feel this entire discussion could have been summarized by the following:

You: You should have compared against conformal prediction methods in your paper as they offer superior performance for probabilistic regression and they offer mathematical guarantees. I: Perhaps - and in hindsight I would probably do so - but at the time we wanted to compare against single model only methods that can produce a full output distribution. I wasn't aware of such conformal prediction methods at the time.

Do point me to conformal prediction methods for gradient boosting that can produce a full (continuous) output distribution using a single model only - I think you're better up to date of the literature of conformal methods.

valeman commented 3 months ago

I think this is better clarification with where we are, in terms of the complete distribution Conformal Predictive Distributions provide full CDF and can be produced on top of any model (including boosted trees) with a few lines of code. Crepes library has them, all the paper references are in this article https://valeman.medium.com/how-to-predict-full-probability-distribution-using-machine-learning-conformal-predictive-f8f4d805e420

Thank you for clarifying.

elephaint commented 2 months ago

Thanks for the reference to Crepes. Unfortunately I can't read the Medium article (paywall). Where can I find the example of producing a full distribution with Crepes? (not familiar with the library, and couldn't find it easily in the docs).

elephaint / pgbm

Reliability for the probabilistic forecasting models #26

plot reliability for both pgb and quantile gradient boosting

label='Perfectly calibrated'