NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, I would like sample counts that are inline to the statistics #211

Open epag opened 2 months ago

epag commented 2 months ago

Author Name: James (James) Original Redmine Issue: 65101, https://vlab.noaa.gov/redmine/issues/65101 Original Date: 2019-06-18


Expected behavior:

Given a statistical output from the WRES, when I look at that output, then I would like to see the sample size inline to each atomic statistic in that output.

"Atomic" means the most elementary statistic. For a score, it means the sample size associated with the score that summarizes a pool. For a diagram, it means the sample size associated with each statistic that summarizes a sub-pool of the pool. For example, with the rank histogram, it means the number of samples in each bin of the rank histogram. For example, with the reliability diagram, it means the sharpness diagram (i.e., it already exists).

Actual behavior:

Currently, the sample size is a separate metric. This is useful and should remain. However, it is also useful, both for users and developers, to have information on the sample sizes inline to the statistical outputs and for the most atomic type of statistic within that output.

This will add a small amount of bloat to the output from the WRES (small relative to all the other metadata, such as time windows), but I think it is useful, both in terms of location (closer to where it is needed) and specificity (more atomic).


Redmine related issue(s): 85491, 88213, 91948, 97399


epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2019-06-18T12:30:33Z


See, for example, #65049-41.

epag commented 2 months ago

Original Redmine Comment Author Name: Chris (Chris) Original Date: 2019-06-18T13:35:08Z


Just out of curiosity, didn't the application do this at one point? It HAD to have been a while back, but it sounds familiar. We already have it in the Netcdf output, though we might want to look at making it a perpetual Netcdf variable since its current form makes the assumption that the number is the same across all features with a given statistic.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2019-06-18T13:54:21Z


No, but you're right in thinking that the most aggregate sample size was part of the metadata at one point (good memory!).

In terms of implementation, I think we'd make it a core component of the statistics, rather than metadata, because that would allow for the sample size to be stored inline to the statistics and at the most atomic level. TBD though.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2019-06-18T19:01:29Z


Note #65085-41. In light of that ticket and post, and contrary to the OP, which states (w/r to the @SampleSize@ metric):

This is useful and should remain. 

It probably should not remain. It is better to include the sample sizes inline to all of the metrics, rather than include the sample size as a separate metric.

epag commented 2 months ago

Original Redmine Comment Author Name: alexander.maestre (alexander.maestre) Original Date: 2021-03-06T06:18:23Z


James - Thank you for pointing me out to this ticket. I will continue here exploring the threshold case using the csv2 file. I can test either way as a separate metric or inline with the stat.

Regards, Alex