dtcenter / MET

Model Evaluation Tools
https://dtcenter.org/community-code/model-evaluation-tools-met
Apache License 2.0
78 stars 24 forks source link

Add the Hersbach CRPS and CRPSS statistics to the ECNT line type. #1450

Closed JohnHalleyGotway closed 3 years ago

JohnHalleyGotway commented 4 years ago

Describe the New Feature

There are two methods for computing the continuous ranked probability score (CRPS) and corresponding skill score (CRPSS), the Gneiting and Hersbach methods (see references below). The ECNT line type (written by Ensemble-Stat for ensembles and Point-Stat for HiRA) includes CRPS and CRPSS as of met-9.1, computed using the Gneiting method.

The US Air Force has been using the Hersbach method and the statisticians on our team have confirmed that it's a worthwhile statistic to compute. It is also the method used to compute CRPS for the WMO-CBS scores.

This task is to add support to MET for the Hersbach version of CRPS and CRPSS. Many questions will arise in its computation: (1) Is it suitable to compute for HiRA in Point-Stat? [Answer: Based on Marion - it's necessary for HiRA to match their statistics] (2) Can it be aggregated over multiple cases in Stat-Analysis and METviewer? How should that be done? (3) What should the new output columns be named? (4) How should this work be funded? - Answer: Funding it through Met Office - key listed below

References: Gneiting, T., A. Westveld, A. Raferty, and T. Goldman, 2004: Calibrated Probabilistic Forecasting Using Ensemble Model Output Statistics and Minimum CRPS Estimation. Technical Report no. 449, Department of Statistics, University of Washington. [Available online at http://www.stat.washington. edu/www/research/reports/ ]

Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Wea. Forecasting, 15, 559-570.

Acceptance Testing

List input data types and sources. Describe tests required for new functionality.

Time Estimate

2 days.

Sub-Issues

Consider breaking the new feature down into sub-issues. No sub-issues required. But this is related to MET #1451.

Relevant Deadlines

Must be in MET-10.0.0

Funding Source

2799991 - Met Office

Define the Metadata

Assignee

Labels

Projects and Milestone

Define Related Issue(s)

Consider the impact to the other METplus components.

New Feature Checklist

See the METplus Workflow for details.

JohnHalleyGotway commented 4 years ago

Relevant email from Eric G about this:

I haven't had a chance to read through the references yet, but here is what Chris Ferro said, which makes sense to me (should use the regular CRPS instead of the normal approximation unless you have a small sample size, which you probably don't).

"Usual practice would be to use the crps for the ensemble edf (or the 'fair' version of this) rather than using the crps for a fitted normal distribution. If the underlying distribution were non-normal then using a fitted normal might give misleading results. Your colleagues might like to know that the crps can be calculated for quite a lot of different types of distribution: see https://cran.r-project.org/web/packages/scoringRules/vignettes/article.pdf

I suppose that if the ensemble is small then it might be the case that a score based on just the ensemble mean and variance (like the crps for a fitted normal) is less noisy than a score based on the ensemble edf. The most common score based on the mean and variance is the Dawid-Sebastiani score (which is equivalent to the log score for a normal distribution) rather than the crps for normal. If the mean and variance are estimated from an ensemble then the DS score is quite sensitive to estimation error in the variance, so noise is a problem. I don't know of any work looking at how sensitive the crps for a fitted normal is to estimation error, though."

RogerHar commented 3 years ago

In response to your question 2 above:

Yes, the ensemble CRPS calculated using Hersbach's method can be aggregated over multiple cases. However, it can also be decomposed into 'reliability' and 'potential' components, similar to the decomposition of the Brier score into reliability, resolution and uncertainty, and we have found this decomposition useful. To do this it's necessary to store the 'alpha' and 'beta' values rather than (or as well as) the CRPS values themselves - see pages 563-564 of Hersbach (2000). We therefore store the alpha and beta values for each case (ensemble forecast - observation pair) in our database. The CRPS and its reliability and potential components can then be calculated from these over a particular set of cases, i.e. time period plus area / set of stations (Hersbach eqs 30-37).

JohnHalleyGotway commented 3 years ago

For #1451 and #1450, CRPS and CRPSS are existing columns in met-9.1. Propose adding several additional columns, as described below.

Is HB a good abbreviation for Hersbach?

For Gneiting Method, add 1 new CRPSCL column between the existing CRPS and CRPSS columns: CRPS CRPSCL CRPSS

For Hersbach Method, add 7 new columns for: CRPS_HB, CRPS_HB_REL (reliability), CRPS_HB_POT (potential), CRPS_HB_ALPHA, CRPS_HB_BETA, CRPSCL_HB, CRPSS_HB

Will also need to enhance Stat-Analysis to aggregate CRPS_HB over multiple cases using the CRPS_HB_ALPHA and CRPHS_HB_BETA values.

Will need to define issues for METviewer and METdatadb to handle new columns that are NOT at the end of the ECNT line type.

RogerHar commented 3 years ago

On column naming: The distinction between the 'Hersbach' and 'Gneiting' methods is that the 'Gneiting' method fits a normal distribution whereas the 'Hersbach' method uses the empirical distribution function of the ensemble without assuming a parametric form. However there are two other formulae due to other authors that are equivalent to the 'Hersbach' CRPS - see p11-12 of this report by Thorarinsdottir & Schuhen (a preprint of their chapter on verification in a 2018 book on statistical postprocessing). For that reason I think it may be preferable to label with some abbreviation for 'Empirical (distribution function)', e.g. EMP or EDF, rather than HB, giving e.g. CRPS_EMP, CRPSS_EMP, CRPSCL_EMP. (If we could start from scratch it might be better if the 'Gneiting' CRPS were labelled CRPS_NORM to stress it assumes a normal distribution and 'CRPS' was the version that doesn't assume a distribution but I guess it's a bit late for that...)

The CRPS decomposition into reliability and potential and the alpha and beta values are due to Hersbach alone though so including HB in their names seems more appropriate, or would simply CRPS_REL, CRPS_POT, CRPS_ALPHA, CRPS_BETA work ?

All this is a detail though, it's more important to us to get them in than worry about what they're called!

JohnHalleyGotway commented 3 years ago

Roger, this is excellent feedback. Thanks! I'll proceed with the column names you recommended: CRPS_EMP, CRPSCL_EMP, CRPSS_EMP, CRPS_REL, CRPS_POT, CRPS_ALPHA, CRPS_BETA After talking with the folks who load this output into the METviewer database downstream, it sounds like I'll need to add all the new columns to the END of the ECNT line... instead of in the middle where they more naturally/logically fit. The loader could be redesigned to more easily handle additions in the middle of lines... but that's a large change to the logic and not realistic prior to the next release. If you're willing/able to write up a paragraph or two about the Gneiting vs Hersback CRPS methods, I'd be happy to include it in the MET User's Guide.

JohnHalleyGotway commented 3 years ago

I found some useful code in the properscoring python package for testing and comparison: https://pypi.org/project/properscoring/

I added some extra print statements to MET to make the comparison easier. Here's log output for a single point:

DEBUG 2: JHG: CRPS = 1.46941, MEAN = 3.53333, STDEV = 2.39221
DEBUG 2: e=np.array([3.90000, 7.20000, 4.60000, 0.10000, 2.10000, 3.30000]); print("mean = ", e.mean()); print("stdev = ", e.std(ddof=1)); print("crps_norm = ", ps.crps_gaussian(1.10000, mu=e.mean(), sig=e.std(ddof=1))); print("crps_emp =", ps.crps_ensemble(1.10000, e));

Running through python we get:

import numpy as np
import properscoring as ps
from scipy.stats import norm
e=np.array([3.90000, 7.20000, 4.60000, 0.10000, 2.10000, 3.30000]); print("mean = ", e.mean()); print("stdev = ", e.std(ddof=1)); print("crps_norm = ", ps.crps_gaussian(1.10000, mu=e.mean(), sig=e.std(ddof=1))); print("crps_emp =", ps.crps_ensemble(1.10000, e));
mean =  3.533333333333333
stdev =  2.39220957833269
crps_norm =  1.469410821589791
crps_emp = 1.555555555555555

So the CRPS (normal), MEAN, and STDEV values match between the two. Now, I just need to enhance MET to compute the crps_emp score.

JohnHalleyGotway commented 3 years ago

@j-opatz and @RogerHar, I made some progress in coding up Hersbach CRPS.

I lifted some python code to compute the empirical CRPS for a single obs and group of ensemble members: https://github.com/dtcenter/MET/blob/8b7ab29455ac67fd9b872041f331df8f640d8c1c/met/src/libcode/vx_statistics/pair_data_ensemble.cc#L1785 And I'm getting the same results as python. So that's good.

But this method does NOT decompose into reliability and potential. I did find that code written up in the "verification" package in R: https://rdrr.io/cran/verification/man/crpsDecompostion.html

I'm trying to work through that code but it's confusing. I'm trying to determine exactly what I need to write to enable the empirical CRPS to be aggregated across multiple cases. I was hoping that writing a single alpha and beta value  for each verification task would suffice. But looking at the R code, I suspect I'd need to write separate alpha and beta values for each ensemble member. So for 30 member ensemble, instead of 1 alpha and beta column, we'd have 30 for each!?

@RogerHar, can you please look at what the MetOffice stores for alpha and beta? Is it one alpha and beta value per case... or is it one alpha and beta values for each ensemble member and case?

RogerHar commented 3 years ago

Sorry @JohnHalleyGotway, you're essentially correct I'm afraid. The alphas and betas have N + 1 separate values for an N-member ensemble. To compute the CRPS reliability and potential you also need the frequencies of the two outliers obar_0 and obar_N defined in Hersbach equation 33. VER has a separate CRPS table type to hold these, similar to that for rank histograms. So as MET has an output line type for RHIST, you may need a separate output line type for these CRPS coefficients. (Alternatively it might be possible to add alpha and beta to the RHIST line type?? I've no idea of the repercussions that would have, but in principle it's kind of neat as there's an alpha and a beta for each possible rank, and the obar_0 and obar_N are the first and last counts of observation ranks so already in RHIST).

My apologies, I'm really not sure how I managed to overlook this before.

JohnHalleyGotway commented 3 years ago

Roger,

Thanks for the followup. I've been testing 2 python packages (properscoring and ensverif) and 1 R package (verification) for computing the empirical CRPS. Perhaps I'm being naive, but this situation seems much simpler than we've made it.

When computing the empirical CRPS over many points for one day, the CRPS for a group of points is just the mean of the CPRS values for the individual points. So, just like we do for existing normal CRPS, it stands to reason that we should aggregate the CRPS over many days by computign a weighted mean of the daily CRPS values, where the weight is determined by the number of pairs each day.

And we also have: CRPS_EMP = CRPS_REL + CRPS_POT

Since the mean of a sum is the same as the sum of the means, we should be able to aggregate CRPS_REL and CRPS_POT in exactly the same way.

So while we do need to compute alpha and beta to decompose CRPS each day, we DO NOT need to actually store the full set of alpha and beta values. Instead, we can aggregate CRPS_EMP, CRPS_REL, and CRPS_POT as weighted means of the daily scores. And that requires much less storage.

That's how I'm proceeding at this point. But please let me know if you see a flaw in this logic that I've overlooked.

...

[Update] I did more testing and see that my assumptions were not correct! While we can aggregate the empirical CRPS as a weighted average, we can NOT aggregate reliability and potential in the same way.

It is true that mean(CPRS_EMP) = mean(CRPS_REL) + mean(CRPS_POT) But... mean(CRPS_REL) != aggregated reliability and mean(CRPS_POT) != aggregated potential

So aggregating these scores correctly would require a lot more storage!

JohnHalleyGotway commented 3 years ago

@RogerHar, @j-opatz and I met to discuss these issues and potential options. Wanted to put this on the radar of @TaraJensen as well.

The short question is this: Is it worth writing CRPS reliability and potential if we don't provide a way to aggregate across multiple cases?

The longer set of details are:

Option (1) Add a new CRPS line type: Writing 2*(N_ENS+1) extra alpha/beta columns to the existing ECNT line type is not feasible. If we need to do that, we would define a new line empirical CRPS line type instead. But of course that would come with many impacts to the processing and aggregation logic, documentation, testing, and loading/processing by METviewer.

Option (2) Write reliability and potential, but don't aggregate it: We could enhance Ensemble-Stat to decompose the empirical CRPS into reliability and potential and write them for each case. But when running Stat-Analysis to aggregate across cases, just report reliability = potential = NA. Also recommend that users NOT plot the mean of reliability and potential either!

Option (3) Do not write reliability and potential: Of course this is the simplest option. If we aren't providing a robust way to aggregate these stats, don't write them at all.

Option (1) is pretty involved and time consuming. When weighed against other MetOffice priorities, is this what we should spend time/funding on?

Option (2) would be fine, but might frustrate users if we fail to provide a good aggregation method.

Perhaps option (3) is fine because while having the CRPS decomposition is nice, maybe it isn't mandatory?

@RogerHar, what's your advice?

RogerHar commented 3 years ago

I'd say that Option (2) is not worth pursuing as the reliability and potential aren't very useful unless they're aggregated across many cases (you can't meaningfully judge whether a forecast is reliable based on a single case).

Option (3) is certainly useful to us, as we don't use the decomposition with HiRA (as far as I'm aware). Several R and python packages compute the ensemble CRPS but not the decomposition.

Option (1) to include the decomposition will be needed eventually before we retire VER if we're to replicate all its functionality. We have found the CRPS decomposition to be useful in the detailed evaluation of our post-processing system.

So I think Option (3) is the initial priority and the CRPS decomposition could be left as a separate lower-priority issue. (I wouldn't like to put a date / milestone on that myself though).