Basic support with stand-alone evaluation

t3dixon commented 3 months ago

Hi there!

I'm trying to mimic a typical HEFS evaluation, but some of my outputs seem to lack resolution. For example, the reliability, rank histogram, and ROC diagram plots only show 2 or maybe three points on the plots. Is there a way to increase the number of bins/points on these for these outputs?

Also, the cross-pair functionality does not seem to work when I run it. Can you help me better understand what this is doing? Pairing the forecasts with the observations appears to be working.

Note that I've commented out the baseline forecast information (and associated skill scores) because the ESP data were too large to attach here. Also, I've lowered the minimum_sample_size setting to 1 for testing, since I'm only using one year of forecast data.

Thank you!

observed:
  sources: HOPC1.QME.csv
  variable: QME
  type: observations
predicted:
  sources: HOPC1.HEFS.tgz
  variable: QINE
  type: ensemble forecasts
#baseline:
#  sources: HOPC1.ESP.tgz
#  variable: QINE
#  type: ensemble forecasts
#  separate_metrics: true
#features:
#  - {observed: HOPC1, predicted: HOPC1, baseline: HOPC1}
unit: CFS
time_scale:
  function: mean
  period: 24
  unit: hours
duration_format: days
lead_times:
  minimum: 0
  maximum: 360
  unit: hours
lead_time_pools:
  period: 24
  frequency: 24
  unit: hours
pair_frequency:
  period: 24
  unit: hours
#cross_pair: exact
probability_thresholds:
  values: [0.1,0.5,0.9,0.95]
  operator: greater equal
minimum_sample_size: 1
metrics:
  - sample size
  - mean error
  - mean absolute error
  - root mean square error
  - pearson correlation coefficient
  - bias fraction
  - brier score
  - continuous ranked probability score
  - relative operating characteristic score
#  - brier skill score
#  - mean square error skill score
#  - continuous ranked probability skill score
  - reliability diagram
  - relative operating characteristic diagram
  - quantile quantile diagram
  - name: rank histogram
    probability_thresholds:
      values: [0.1,0.5,0.9,0.95]
      apply_to: predicted
decimal_format: '#0.00000'
output_formats:
  - csv2
  - pairs
  - png

HOPC1.QME.csv HOPC1.HEFS.tgz wres-test-outputs.zip

james-d-brown commented 3 months ago

Hey Taylor,

Let me try to reproduce your results locally so that I can provide some insight. A few answers in principle, though:

The Reliability and ROC diagrams contain 10 points/bins, by default.
The rank histogram contains as many points as ensemble members +1 because it displays the frequency of observations falling in between consecutive members, with one bin on each end.
Cross-pairing works by selecting only those pairs with "corresponding times" in both the main and baseline datasets. In this context, corresponding times always requires that the valid times of predictions match exactly, i.e., for each prediction within the predicted dataset, there is a prediction within the baseline dataset that has a corresponding valid time, and vice versa, otherwise that prediction is removed. Additionally, when declaring exact the forecast reference times must match exactly. When declaring fuzzy, the nearest reference times are used. This is useful, for example, when pairing RFC operational forecasts that are issued at non-synoptic times and comparing them to a baseline whose forecasts are issued at synoptic times. In this case, there will be an identical number of pairs for the predicted and baseline datasets after cross-pairing, even though the forecast issue times differ. More on cross-pairing here: https://github.com/NOAA-OWP/wres/wiki/Declaration-Language#15-are-there-any-other-options

My guess is that there are some peculiarities with your dataset that are leading to these odd results, such as all ensemble members having the same value, consistently. But I will try to reproduce locally as a starting point...

Cheers,

James

james-d-brown commented 3 months ago

Reproduced.

The first thing I notice is these two warnings:

    - The evaluation declares a 'time_scale', but the 'time_scale' associated with the 'observed' dataset is undefined. Unless the data source for the 'observed' dataset clarifies its own time scale, it is assumed that the dataset has the same time scale as the evaluation 'time_scale' and no rescaling will be performed. If this is incorrect or you are unsure, it is best to declare the 'time_scale' of the 'observed' dataset.
    - The evaluation declares a 'time_scale', but the 'time_scale' associated with the 'predicted' dataset is undefined. Unless the data source for the 'predicted' dataset clarifies its own time scale, it is assumed that the dataset has the same time scale as the evaluation 'time_scale' and no rescaling will be performed. If this is incorrect or you are unsure, it is best to declare the 'time_scale' of the 'predicted' dataset.

In general, it is best to either declare the time-scale of the data inband to the data (i.e., to use a format that supports this, such as the CSV format) or to declare the time-scale of the data within the declaration itself. In the absence of this, an assumption will be made, as indicated in the warnings.

Such declaration may look like this (or whatever the data represents):

observed:
  sources: HOPC1.QME.csv
  variable: QME
  type: observations
  time_scale:
    function: mean
    period: 24
    unit: hours
predicted:
  sources: HOPC1.HEFS.tgz
  variable: QINE
  type: ensemble forecasts
  time_scale:
    function: mean
    period: 1
    unit: hours

This is followed up with warnings on the calculation of every single pool along these lines:

2024-08-19T16:51:52.674+0000 WARN PoolReporter [1/15] Completed statistics for a pool in feature group 'HOPC1-HOPC1'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-31T23:59:59.999999999Z, Earliest valid time: -1000000000-01-01T00:00:00Z, Latest valid time: +1000000000-12-31T23:59:59.999999999Z, Earliest lead duration: PT0S, Latest lead duration: PT24H ). However, encountered 730 evaluation status warning(s) when creating the pool. Of these warnings, 730 originated from 'RESCALING'. An example warning follows for each evaluation stage that produced one or more warnings. To review the individual warnings, turn on debug logging. Example warnings: {RESCALING=EvaluationStatusMessage[LEVEL=WARN,STAGE=RESCALING,MESSAGE=While inspecting a time-series dataset with a 'predicted' orientation, failed to discover the time scale of the time-series data. However, the evaluation requires a time scale of [PT24H,MEAN]. Assuming that the time scale of the data matches the evaluation time scale and that no rescaling is required. If that is incorrect, please clarify the time scale of the time-series data. The time-series metadata is: TimeSeriesMetadata[timeScale=<null>,referenceTimes={},variableName=QME,feature=Feature[name=HOPC1,description=,srid=0,wkt=],unit=CFSD]]}.

Next, and I think this is the crux of the problem, I see only one ensemble member in the paired data. Why? Well, it seems that the predictions use the keyword ensemblemember, but the actual/effective keyword is ensemblemember_id:

https://github.com/NOAA-OWP/wres/wiki/Format-Requirements-for-CSV-Files

In other words, the software is interpreting every single forecast as a single-valued forecast because it is ignoring the ensemble information. The software is lenient about the presence of columns that do not coincide with expected keywords, but it will not use the information for those columns.

In short, I would start by fixing the keyword column in the header (ensemblemember --> ensemblemember_id), which should reintroduce the ensemble information.

Let me know if something above doesn't make sense.

Cheers,

James

NOAA-OWP / wres

Basic support with stand-alone evaluation #65