dtcenter / MET

Model Evaluation Tools
https://dtcenter.org/community-code/model-evaluation-tools-met
Apache License 2.0
78 stars 24 forks source link

Add the fair CRPS statistic to the ECNT line type in a new CRPS_EMP_FAIR column #2206

Closed j-opatz closed 1 year ago

j-opatz commented 2 years ago

Describe the New Feature

This comes from a discussions comment and seeks to include the fair CRPS verification statistic for ensembles, specifically the ECNT line type. As Roger indicates in a further comment on the Discussions, the fair CRPS " address(es) one particular form of bias in verification metrics for ensemble forecasts, namely that from assessing raw NWP output from a finite-size ensemble". According to the literature linked in the Discussions, this output could be achieved fairly easily by copying the already-calculated CRPS code for ensembles and adjusting it, as illustrated in Equation 4 of Fricker, Ferro, & Stephenson, D.B. (2013). POC for this should be @RogerHar

Acceptance Testing

We should work with the UK Met office to gather data for testing. Describe tests required for new functionality.

Time Estimate

Estimate the amount of work required here. Issues should represent approximately 1 to 3 days of work.

Sub-Issues

Consider breaking the new feature down into sub-issues.

Relevant Deadlines

List relevant project deadlines here or state NONE.

Funding Source

2799991

Define the Metadata

Assignee

Labels

Projects and Milestone

Define Related Issue(s)

Consider the impact to the other METplus components.

New Feature Checklist

See the METplus Workflow for details.

RogerHar commented 2 years ago

Thanks for opening this issue. The CRPS adjustment term is $1/(2m)$ times the mean absolute difference of the ensemble members $\frac{1}{m(m-1)} \sum_{i \ne j} |X_i - X_j|$ (where m is the ensemble size), and I've realised that this mean absolute difference can be useful in its own right as a measure of ensemble spread,[^1] so I'm wondering if this might be worth including as a separate statistic?

[^1]: Hopson, T. M. (2014). Assessing the Ensemble Spread–Error Relationship, Monthly Weather Review, 142(3), 1125-1142. DOI: 10.1175/MWR-D-12-00111.1

sethlinden commented 2 years ago

@TaraJensen I am going to start working on this issue per John's suggestion. What account key should I use for this work?

JohnHalleyGotway commented 2 years ago

Recommend naming the new column(s) as "CRPS_FAIR". Questions for @RogerHar:

  1. Is the Fair CRPS an adjustment to the gaussian CRPS or the empirical CRPS?
  2. Do we only need the forecast Fair CRPS or should we also add Fair Climo CRPS and Fair CRPSS?

Recommend updating the following files:

  1. Update the test_unit header file (internal/test_unit/hdr/met_11_0.hdr) by adding the new column(s) to the end of the ECNT line.
  2. Update the ECNT header in a second file (data/table_files/met_header_columns_V11.0.txt).
  3. Update the Ensemble-Stat documentation in 2 spots:
    • Add the new column(s) to the ECNT table in docs/Users_Guide/ensemble-stat.rst.
    • Add the equation(s) to Appendix C in docs/Users_Guide/appendixC.rst
  4. Likely no need to add new tests since the existing test data will be updated.
  5. Update the vx_util library with the ECNT column(s):
    • In src/basic/vx_util/stat_column_defs.h update the ecnt_columns array.
  6. Update the vx_stat_out library with the new ECNT column(s).
    • In src/libcode/vx_stat_out/stat_columns.cc update the write_ecnt_cols() function.
  7. Update the Stat-Analysis application to handle the new column(s).
    • In src/tools/core/stat_analysis/parse_stat_line.h, update the ECNTData struct by adding variables to store data from the new column(s).
    • In src/tools/core/stat_analysis/parse_stat_line.cc, update the parse_ecnt_line() function to read data from the new column(s).
    • In src/tools/core/stat_analysis/aggr_stat_line.cc, update the aggr_ecnt_lines() function to (1) stash the new data in the EnsemblePairData class, and (2) compute the aggregated value(s) just like the other CRPS stats are handled.
  8. Update the vx_statistics library to compute and store the Fair CRPS statistic.
    • In src/libcode/vx_statistics/pair_data_ensemble.h update the PairDataEnsemble class with new NumArray object(s) for this data.
    • In src/libcode/vx_statistics/pair_data_ensemble.cc search for instances of crps_gaus or crps_emp and add similar logic for crps_fair. To actually compute the Fair CRPS, it's an adjustment of the crps_gaus or crps_emp value. Recommend adding new NumArray::mean_abs_diff() function and use it in the compute_pair_vals() function like this: crps_fair_na.add(crps(_emp or _gaus?) - ens_na.mean_abs_diff();) Psudo-code for mean_abs_diff (not sure about the looping logic here):
      for(i=0,sum=0.0;i<n;i++) {
      for(j=i+1;j<n;j++) {
      sum += abs(e[i]-e[j])
      }
      }
      return(sum / (n*(n-1)))
    • Update the ECNTInfo class in src/libcode/vx_statistics/ens_stats.h by adding new variable(s) for these new stat(s).
    • Update src/libcode/vx_statistics/ens_stats.cc by searching for instances of crps_gaus or crps_emp and adding similar logic for crps_fair.
RogerHar commented 2 years ago

Recommend naming the new column(s) as "CRPS_FAIR". Questions for @RogerHar:

  1. Is the Fair CRPS an adjustment to the gaussian CRPS or the empirical CRPS?

It's an adjustment to the empirical CRPS.

  1. Do we only need the forecast Fair CRPS or should we also add Fair Climo CRPS and Fair CRPSS?

Good question. I don't think a separate fair climatological CRPS should be necessary, I think it should be ok to construct a Fair CRPSS using an existing climatological reference forecast. It's less obvious which one.... I apologise that I hadn't previously realised the distinction between CRPSCL and CRPSCL_EMP used in computing CRPSS and CRPSS_EMP, so I've just been reading Appendix C of the User's Guide and the Climatology data section of the Ensemble-Stat User's Guide.

The point of the Fair CRPS is to remove the dependence on ensemble size, so I don't think CRPSCL_EMP would be appropriate as a reference as that depends on the number of climatology values used in the calculation as determined by the cdf_bins setting, which is similar to an ensemble size. As MET assumes that the climatology follows a normal distribution, to define a fair CRPSS I think I'd always want to use the climatological CRPS obtained by plugging the climatological mean and sd into the Gneiting formula for the CRPS of a normal distribution, which I believe is CRPSCL.

I understand from #1451 (which I hadn't looked at before either) that CRPS_EMP was mainly introduced to reproduce the skill score produced by VSDB? As the fair CRPS is completely new that's not an issue here.

You could of course leave out the fair CRPSS and just leave it up to the user if they wish to compute a skill score using either of the climatological reference scores, as both are included in the ECNT output line type. But I guess that might be a bit confusing.

sethlinden commented 2 years ago

@RogerHar thanks for the detailed response. I think for now we will just create the CRPS_EMP_FAIR variable and this is the development path I was going down anyway. We can always go back and do this for CRPSCL if necessary.

sethlinden commented 2 years ago

Based on conversion with @j-opatz and @JohnHalleyGotway, made an update to the code such that: when all the forecast/observed precip is 0, we were getting CRPS_EMP and CRPS_EMP_FAIR computed as 1.3563e-05 and 6.7817e-06. Those numbers are very close to 0, but not actually 0 because of machine precision. So we are setting the values to 0:

// Compute empirical CRPS scores crps_emp = pd.crps_emp_na.wmean(pd.wgt_na); if(is_eq(crps_emp, 0.0)) crps_emp = 0.0; crps_emp_fair = pd.crps_emp_fair_na.wmean(pd.wgt_na); if(is_eq(crps_emp_fair, 0.0)) crps_emp_fair = 0.0;

sethlinden commented 1 year ago

We discovered a small mathematical bug in the calculation for Mean Absolute Difference (SPREAD_MD) which is also used to calculate CRPS_EMP_FAIR. In the spread-md calculation we were dividing by (n*(n-1)), where n is the total number of ensemble members. But since the loop over the sum of the differences already takes into account when i !=j, we really needed to divide by the number of "differences", the count of the diff pairs.

So instead of mad = sum / (n*(n-1));, we need to do: mad = sum / count;