NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, I want to reconsider the default pooling windows for forecast datasets #132

Open epag opened 1 month ago

epag commented 1 month ago

Author Name: James (James) Original Redmine Issue: 120817, https://vlab.noaa.gov/redmine/issues/120817 Original Date: 2023-09-22


Given an evaluation that contains forecasts When I consider the default pooling to adopt in the absence of pooling declaration (@lead_time_pools@, @reference_date_pools@ or @valid_date_pools@) Then I want to adopt one pool for each forecast lead time So that the default is closer to the typical user expectation for pooling forecast data, which is one pool per lead duration and not one big pool

epag commented 1 month ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-09-22T11:26:37Z


Have noticed some confusion about this from several users, notably #112193 and, offline/e-mail, from conversations with Seann Reed and Yuqiong.

I think the main issue is mentally engaging with the concept of explicit pools. There is some overhead to this engagement and it probably isn't necessary until a user has an atypical use case, such as rolling windows or pooling multiple lead durations into a single pool, at which point they've already engaged with the idea and are not being forced to engage by the software.

There are some downsides to this too, such as adding a data-dependent default, rather than a simple, single default and building a potentially expensive db query to obtain the unique lead times from the ingested data.

Still, on balance, it might lead to less confusion and simpler declarations for the majority use case involving forecast datasets.

epag commented 1 month ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-09-22T11:31:59Z


For situations where the @time_scale@ is defined, this would need to be factored too, both because it overrides whatever the data says about available lead times and the majority use case would involve non-overlapping pools ending every @time_scale@ @period@. In that case, the data analysis/db query would involve discovering the smallest and largest lead times only, absent explicit @lead_times@.

Again, this does add some complexity, both in terms of default behavior and implementation, but it's worth considering further.