As a user, I want the WRDS-based thresholds output in the WRES output to be fully qualified so that I can concretely identify where they came from - Githubissues

epag commented 1 month ago

Author Name: Hank (Hank) Original Redmine Issue: 97209, https://vlab.noaa.gov/redmine/issues/97209 Original Date: 2021-10-06

See #97142. This can happen if a user evaluates NWM against USGS using NWS thresholds. For some NWM feature ids, thresholds from multiple NWS locations can be returned. The code needs to handle that; that is what #97142 is for. However, when seeing the thresholds in the output, I would like them qualified sufficiently to identify the WRDS location for which the thresholds were returned. It should also show the rating provider (e.g., USGS or NRLB) and rating source. There may be other important metadata to share.

This ticket applies to CSV2 and protobuf, I assume, but I'm not sure about the others. Images should not include that metadata as there is no good way to present it in a graphic, I think.

Thanks,

Hank

Related issue(s): #288 Redmine related issue(s): 98151, 99233, 99428

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-11-24T16:45:56Z

Just quickly tested a code change to see what impact it would have.

If I construct the @ThresholdOuter@ instances for WRDS thresholds from rating curves so that they include text after the name indicating the source of the rating curve from which the threshold is derived, that is enough to ensure that, when the rating curve source is not declared by the user, all thresholds from all rating curves are evaluated (see #98837). For example:

01464000_2590137_TACN4_NWM_Medium_Range_Single_Valued_SAMPLE_SIZE.csv:FEATURE DESCRIPTION,EARLIEST ISSUE TIME,LATEST ISSUE TIME,EARLIEST VALID TIME,LATEST VALID TIME,EARLIEST LEAD TIME IN SECONDS [INSTANTANEOUS],LATEST LEAD TIME IN SECONDS [INSTANTANEOUS],SAMPLE SIZE All data,SAMPLE SIZE >= 1362.5 ft3/s (action from rating curve USGS Rating Depot),SAMPLE SIZE >= 1362.5 ft3/s (bankfull from rating curve USGS Rating Depot),SAMPLE SIZE >= 1363.0 ft3/s (action from rating curve NRLDB),SAMPLE SIZE >= 1363.0 ft3/s (bankfull from rating curve NRLDB),SAMPLE SIZE >= 2034.0 ft3/s (flood from rating curve NRLDB),SAMPLE SIZE >= 2034.0 ft3/s (minor from rating curve NRLDB),SAMPLE SIZE >= 2034.2 ft3/s (flood from rating curve USGS Rating Depot),SAMPLE SIZE >= 2034.2 ft3/s (minor from rating curve USGS Rating Depot),SAMPLE SIZE >= 2537.0 ft3/s (moderate from rating curve NRLDB),SAMPLE SIZE >= 2537.2 ft3/s (moderate from rating curve USGS Rating Depot),SAMPLE SIZE >= 3368.0 ft3/s (major from rating curve NRLDB),SAMPLE SIZE >= 3368.3 ft3/s (major from rating curve USGS Rating Depot)

If I further change the the key in a map that stores the thresholds to use the full label, I can obtain outputs for the raw thresholds as well as the rating curve thresholds when a user does not clarify what they want. For example:

[Hank@nwcal-wres-ti01 output_for_NJ_with_threshold_name_update_individual_features_raw_too]$ grep FEATURE 01464000_2590137_TACN4_NWM_Medium_Range_Single_Valued_SAMPLE_SIZE.csv 
FEATURE DESCRIPTION,EARLIEST ISSUE TIME,LATEST ISSUE TIME,EARLIEST VALID TIME,LATEST VALID TIME,EARLIEST LEAD TIME IN SECONDS [INSTANTANEOUS],LATEST LEAD TIME IN SECONDS [INSTANTANEOUS],SAMPLE SIZE All data,SAMPLE SIZE >= 720.0 ft3/s (action),SAMPLE SIZE >= 1040.0 ft3/s (flood),SAMPLE SIZE >= 1362.5 ft3/s (action from rating curve USGS Rating Depot),SAMPLE SIZE >= 1362.5 ft3/s (bankfull from rating curve USGS Rating Depot),SAMPLE SIZE >= 1363.0 ft3/s (action from rating curve NRLDB),SAMPLE SIZE >= 1363.0 ft3/s (bankfull from rating curve NRLDB),SAMPLE SIZE >= 2034.0 ft3/s (flood from rating curve NRLDB),SAMPLE SIZE >= 2034.0 ft3/s (minor from rating curve NRLDB),SAMPLE SIZE >= 2034.2 ft3/s (flood from rating curve USGS Rating Depot),SAMPLE SIZE >= 2034.2 ft3/s (minor from rating curve USGS Rating Depot),SAMPLE SIZE >= 2537.0 ft3/s (moderate from rating curve NRLDB),SAMPLE SIZE >= 2537.2 ft3/s (moderate from rating curve USGS Rating Depot),SAMPLE SIZE >= 3368.0 ft3/s (major from rating curve NRLDB),SAMPLE SIZE >= 3368.3 ft3/s (major from rating curve USGS Rating Depot)

I can provide more information about the test I performed, if desired. The code change was to the method @GeneralThresholdDefinition.getThresholds@; the modified code is below, shared for posterity because I plan to undo it. However, my only goal was to confirm that insufficiently identifying thresholds was the cause of the problem noted in #98837, and, indeed, it is. I also confirmed a run where the features were pooled appropriately identified the thresholds by name.

Thanks,

Hank

===================================================================

        //Then we overwrite the original with calculated.
        if ( calculatedThresholds != null )
        {
            for ( Entry<String, Double> threshold : calculatedThresholds.entrySet() )
            {
                if ( threshold.getValue() != null )
                {
                    double value = calculatedUnitConversionOperator.applyAsDouble( threshold.getValue() );
                    thresholdMap.put( threshold.getKey() + " from rating curve " + getCalc_flow_values().getRating_curve().getSource(),
                                      ThresholdOuter.of(
                                                         OneOrTwoDoubles.of( value ),
                                                         thresholdOperator,
                                                         dataType,
                                                         threshold.getKey() + " from rating curve " + getCalc_flow_values().getRating_curve().getSource(),
                                                         MeasurementUnit.of( desiredUnitMapper.getDesiredMeasurementUnitName() ) ) );
                }
            }
        }
</code>

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-11-30T18:39:33Z

James, Jesse:

Is there any value in pursuing what I described in #97209-7 as an interim solution to evaluating all WRDS thresholds implied by a declaration and relaying that identifier to the output?

Recall that, if a user does not specify a rating curve provider, then currently the thresholds evaluated are the last thresholds found for a given name (e.g., "flood", "major", "action", etc.) while processing a map of threshold name -> threshold value. It could be a mix of NRLDB and USGS Rating Depot calculated thresholds. With the inelegant fix above, the name of the rating curve source is included in the name of the threshold.

The real fix is an update to the data model, but that will require significantly more time to complete (at least for me, since its not a part of the code I'm familiar with).

Thanks,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-11-30T19:15:40Z

I am not against it, largely because it is "better" than what exists. It's a bit yucky to make a half- fix, but it is better than broken, so I would go for it, on reflection. At the same time, you should probably anticipate the half- solution to exist semi-permanently ;)

epag commented 1 month ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-11-30T19:19:31Z

In short, I don't know.

A threshold of @>= 50 m3/s@ is identical to a threshold of @>= 50 m3/s@, regardless of name, and so it makes sense to merge them into one threshold for the purposes of data processing. On the other hand, the names matter because we might want to group by "NWS Major Flood" or plot that way, so we need to track the names and keep them separate. Maybe there is a way to track the threshold itself as a key and then a list of names as the value. That does sound like a more involved change. I don't have an objection to the qualifier in the name like you have above, except that it can be shorter like "NRLDB major" or "USGS Rating Depot bankfull."

epag commented 1 month ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-11-30T19:39:49Z

The permanent fix is pretty clear, I think, namely to expand the @ThresholdOuter@ (and hence the canonical @Threshold@ it wraps) to allow a ratings qualifier. Once that is done, we can make any downstream choices we want w/r to de-duplication. Currently, thresholds are de-duplicated on the basis of climatological probabilities only (edit: that is to say, different probabilities that map to the same real-valued threshold), so thresholds with distinct names will not be de-duplicated, even if the threshold values are the same.

Implementing the proper fix is probably something like "32". Implementing the half-***** fix is probably "0.32" or something. Note that the main reason for "32" is that it will create some additional slots in some of the statistics output formats. It shouldn't break any system tests because there are none with different ratings.

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-11-30T19:45:13Z

Thanks, James and Jesse.

Yeah, 0.32 hours is pretty close... I'm guessing it would take me about 1-2 hours to complete the change. The code change is already in place; I just need to run some system tests to see if anything breaks, because it may also impact WRDS JSON threshold tests which I think we include in one of the 800s, iirc.

We'll also need to agree upon the label (Jesse's shorter proposal looks fine to me).

I'll try to get to it later this week; for now, I need to prioritize the proxy issues, deploying 5.16, and completing at least some initial tests of Chris's output service stuff.

Thanks,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-11-30T19:49:16Z

But you won't be adding the qualifier unless there are duplicates, I assume(?) And I further assume that scenario802 does not contain thresholds with several different sources. Could be wrong on either of these, but I wouldn't add the qualification unless it is needed to de-duplicate. That way, we retain existing naming except where the extra qualification is needed (because the naming isn't really the correct spot for this qualification).

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-11-30T19:54:30Z

I can add an if-check to see if "major" already exists and, if so, add the rating curve identifier. No problem. Of course, I expect that to often be the case for the most "important" forecast points where an RFC has defined various raw thresholds already.

I'll check the 800 series thresholds to see what is used. I never looked closely.

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-11-30T19:57:05Z

scenario802 has two sets of thresholds from different threshold sources (as opposed to rating curves). It should trigger the same problem if the declaration does not clarify which source is desired. However, scenario802 does set the @provider@, so it won't be an issue.

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-11-30T20:05:39Z

Right, so there are two extra qualifiers and possibly nothing to stop N>2 qualifiers in future, so then you get into a situation where you may just want a -bag- map of qualifiers, i.e., key-value pairs and a corresponding builder method to add a new key-value qualifier pair. Even the proper fix is a bit yucky due to the complexity of thresholds.

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-11-30T20:14:30Z

Exactly.

By the way, an alternative is to force the user to be very specific about the thresholds they want from WRDS in such a way that duplication is not possible. We can require the source and rating source be specified, for example, if the format is WRDS. Then we map to only a single set of thresholds and can rely on WRDS labeling to avoid duplication of names. Or we can throw an exception when a duplicate is encountered and point to the attributes in the exception message ("Did you configure the provider and ratingProvider?").

I'm actually not opposed to implementing either of those, instead, since it makes no sense to me to mix thresholds from NRLDB, USGS rating depot, etc. I'm just not sure that everyone will think the way I do. Perhaps someone out there has a good reason to evaluate thresholds from both sets of curves.

If either of you prefer one of those alternative solutions, let me know.

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: James (James) Original Date: 2021-11-30T20:25:10Z

I think that would be pretty harsh, assuming a user wanted to evaluate the commonly-named thresholds from several different key-value pairs. I think your proposed hack is the right hack.

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-11-30T20:32:53Z

Sounds good. I'll try to get to it later in the week, perhaps even as soon as tomorrow.

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-07-21T12:32:55Z

James,

Did anything change related to this ticket as a byproduct of changes to thresholds recently put in place as a byproduct of moving to YAML?

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-07-21T12:36:12Z

The most foolproof way of knowing is to run an evaluation and witness the behavior :-)

I forget what we did for these one:many feature relationships w/r to thresholds.

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-07-21T13:35:05Z

I note the Description references the fact that WRDS can return RFC thresholds for multiple NWS locations when asked for based on NWM feature. This issue was raised as part of the UAT in #116116. Let me find the conversation there or in a related ticket,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-07-21T17:32:33Z

The output when using WRDS thresholds only include the threshold value and WRDS label, such as, "SAMPLE SIZE >= 0.0 ft3/s (low)". It could also include a reference to the rating curve identifier, such as, "MEAN ERROR >= 64.0 ft3/s (NRLDB low)".

However, there is no indication of the source being WRDS, though it may be understood based on the label shown. There is also no indication of the feature for which the thresholds were found. So, for example, if NWS thresholds are obtained using a NWM feature id, then thresholds could be returned for multiple NWS location ids. Hence, it would make sense to discern between them in the WRES outputs.

As discussed in more recent tickets, the idea of requesting NWS thresholds by an NWM feature id is flawed from the start, so perhaps we should not worry about the location id being included. I guess the question is, do we need some way to let users know that thresholds used in an evaluation were obtained from WRDS as opposed to, say, a CSV file or in-line declaration? But that would mean both WRDS and custom thresholds are evaluated in a single run. That sounds like a highly unlikely scenario.

At this point, I'm not sure that the thresholds are qualified enough in our outputs, but the only use cases where I think this could lead to confusion are not very likely, so I'm lowering the priority. I reserve the right to reject this ticket if we decide the benefit doesn't justify the added code.

Hank

NOAA-OWP / wres

As a user, I want the WRDS-based thresholds output in the WRES output to be fully qualified so that I can concretely identify where they came from #281