NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, I want duplicate time-series from NWIS to be treated in the same way as duplicates from other sources and services #108

Open epag opened 3 weeks ago

epag commented 3 weeks ago

Author Name: James (James) Original Redmine Issue: 113397, https://vlab.noaa.gov/redmine/issues/113397 Original Date: 2023-03-03


Given an evaluation that selects "duplicate" time-series from NWIS for a given feature When that evaluation proceeds Then the duplicates should be treated in the same way they're treated when reading from others sources and services So that the software is predictable


Redmine related issue(s): 116680


epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-03T19:38:14Z


The general rule is as follows.

Consider all time-series supplied by a source or service unless the user has supplied a filter to de-duplicate time-series.

The work in this ticket is to implement the general rule, but also to warn when duplicates are discovered as a convenience to users and to provide users with the options they need to de-duplicate time-series by filtering, if desired.

For example, when considering duplicates from NWIS, it should be possible to filter on the basis of whatever distinguishes the duplicates using an appropriate api parameter (e.g., transmitted via a WRES URL parameter), else to provide a filter that can be applied downstream of reading.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-03T19:40:31Z


There's an example dataset in #104572-25. To create an example evaluation, use this on both the left and right sides of the evaluation. Absent filters, that evaluation should produce as many paired time-series as time-series read from the service. With filters, it should produce a subset, depending on the time-series to be de-duplicated.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-03T19:42:05Z


Current behavior is to ignore (edit: albeit with a logged warning) any features for which duplicate time-series are discovered, which is highly undesirable and confusing to users.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T12:22:49Z


James wrote:

The work in this ticket is to implement the general rule, but also to warn when duplicates are discovered as a convenience to users and to provide users with the options they need to de-duplicate time-series by filtering, if desired.

The first two parts of this should be straightforward. The third part may not be straightforward, depending on the source of duplication and whether nwis offers url parameter(s) to de-duplicate. If not, that will require extra work but, on reflection, that should probably be a separate ticket, the addition of new filters on time-series metadata.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T13:03:51Z


First thing is to get a better log message so that the source of duplication is clearer.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T13:18:21Z


But, before that, need to reproduce with a simple example.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T13:31:37Z


Reproduced with this:

<?xml version='1.0' encoding='UTF-8'?>
<project name="issue113397">
  <inputs>
    <left>
      <type>observations</type>
      <source interface="usgs_nwis">https://nwis.waterservices.usgs.gov/nwis/iv</source>
      <variable>00060</variable>
    </left>
    <right>
      <type>observations</type>
      <source interface="usgs_nwis">https://nwis.waterservices.usgs.gov/nwis/iv</source>
      <variable>00060</variable>
    </right>
  </inputs>
  <pair>
    <unit>ft3/s</unit>
    <feature left="16103000" right="16103000" />
    <dates earliest="2022-01-01T00:00:00Z" latest="2023-01-01T00:00:00Z"/>
  </pair>
  <metrics>
    <metric><name>sample size</name></metric>
  </metrics>
  <outputs durationFormat="hours">
    <destination type="pairs"/>
    <destination type="csv2"/>
  </outputs>
</project>
</code>

2023-03-06T13:31:04.785+0000 WARN WatermlReader Skipping site 16103000 because multiple timeseries were discovered for variable 00060 from URI https://nwis.waterservices.usgs.gov/nwis/iv?endDT=2023-01-01T00%3A00%3A00Z&format=json&parameterCd=00060&sites=16103000&startDT=2022-01-01T00%3A00%3A01Z.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T14:04:08Z


Better message, although the time-series still need to be admitted (one-liner):

2023-03-06T14:00:45.916+0000 WARN WatermlReader Discovered 2 duplicate time-series for site 16103000 and variable 00060 from URI https://nwis.waterservices.usgs.gov/nwis/iv?endDT=2023-01-01T00%3A00%3A00Z&format=json&parameterCd=00060&sites=16103000&startDT=2022-01-01T00%3A00%3A01Z. The method qualifiers were: [Method[methodID=321515,methodDescription=[~100 feet downstream]], Method[methodID=42104,methodDescription=[Upstream]]]. The general qualifiers were: [Qualifier[qualifierID=0,qualifierCode=P,qualifierDescription=Provisional data subject to revision.,network=NWIS,vocabulary=uv_rmk_cd], Qualifier[qualifierID=0,qualifierCode=e,qualifierDescription=Value has been estimated.,network=NWIS,vocabulary=uv_rmk_cd], Qualifier[qualifierID=1,qualifierCode=P,qualifierDescription=Provisional data subject to revision.,network=NWIS,vocabulary=uv_rmk_cd]]. The censor codes were: []. The sources were []. The Quality control levels were: []. All of these time-series were admitted. If this behavior is undesired, filtering may be possible.

This clarifies that the primary distinction is on the method:

The method qualifiers were: [Method[methodID=321515,methodDescription=[~100 feet downstream]], Method[methodID=42104,methodDescription=[Upstream]]]

In other words, different gauge locations, something like that.

As to whether NWIS exposes the @methodID@ as a filter, that is TBD but, if not, the preferred solution would be to have that filter added and the less preferred solution would be to add general purpose key-value pair filters to the application itself, but that is another ticket.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T14:50:23Z


Tweaked the message:

2023-03-06T14:47:52.611+0000 WARN WatermlReader Discovered 2 duplicate time-series for site 16103000 and variable 00060 from URI https://nwis.waterservices.usgs.gov/nwis/iv?endDT=2023-01-01T00%3A00%3A00Z&format=json&parameterCd=00060&sites=16103000&startDT=2022-01-01T00%3A00%3A01Z. Duplication means that several time-series were encountered that had overlapping valid times and different metadata. Encountered the following metadata across the duplicate time-series: the method qualifiers were [Method[methodID=321515,methodDescription=[~100 feet downstream]], Method[methodID=42104,methodDescription=[Upstream]]]; the general qualifiers were [Qualifier[qualifierID=0,qualifierCode=P,qualifierDescription=Provisional data subject to revision.,network=NWIS,vocabulary=uv_rmk_cd], Qualifier[qualifierID=0,qualifierCode=e,qualifierDescription=Value has been estimated.,network=NWIS,vocabulary=uv_rmk_cd], Qualifier[qualifierID=1,qualifierCode=P,qualifierDescription=Provisional data subject to revision.,network=NWIS,vocabulary=uv_rmk_cd]]; the censor codes were []; the sources were []; and the quality control levels were []. All of the duplicate time-series were admitted. If this behavior is undesired, filtering may be possible.

Also admitted the time-series through, so the evaluation now passes.

2023-03-06T14:47:56.811+0000 INFO Evaluation Closed evaluation P3NRvwq-gM3ywFOa8A3yT6_6Gho with status EVALUATION_COMPLETE_REPORTED_SUCCESS. This evaluation contained 1 evaluation description message, 1 statistics messages, 0 pairs messages and 5 evaluation status messages. The exit code was 0.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T14:51:10Z


Next question: if a user wanted to de-duplicate against method ID, could they do that using the NWIS API?

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T14:54:29Z


Seemingly not. For example, adding an arbitrary method id:

https://nwis.waterservices.usgs.gov/nwis/iv?endDT=2023-01-01T00%3A00%3A00Z&format=json&parameterCd=00060&sites=16103000&startDT=2022-01-01T00%3A00%3A01Z&methodID=1

That returns the same time-series as before.

Likewise, if I include a legitimate method ID:

https://nwis.waterservices.usgs.gov/nwis/iv?endDT=2023-01-01T00%3A00%3A00Z&format=json&parameterCd=00060&sites=16103000&startDT=2022-01-01T00%3A00%3A01Z&methodID=42104

No dice.

But if I include an arbitrary url parameter, @foo=bar@, I do get a 400 error code (@unrecognized keyword: foo [position=0], server=[nadww01]@):

https://nwis.waterservices.usgs.gov/nwis/iv?endDT=2023-01-01T00%3A00%3A00Z&format=json&parameterCd=00060&sites=16103000&startDT=2022-01-01T00%3A00%3A01Z&foo=bar

So perhaps the methodID is part of the API, but not respected, so perhaps that is a bug. I will try to report it.

I would prefer not to add general purpose time-series filters until we actually need them. I think this de-duplication problem should be solvable through a url parameter on the nwis request.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T15:00:14Z


Looks like their feedback form is not found:

https://answers.usgs.gov/cgi-bin/gsanswers?tmplt=1&pemail=gs-w_waterdata_support

I will need to dig out an old contact and see how best to report bugs.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T15:09:28Z


Contact found, e-mail sent. I will update the ticket if/when I get a reply and then create a new WRES ticket, as needed.

In the mean time, I will run the system tests and then push the rudimentary solution to this ticket (treat NWIS like any other source/service, but without filters to aid in de-duplication, yet).

It's perfectly possible that some of the 7xx or other system tests that rely on NWIS will be broken on additional benchmark datasets, TBD.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T15:28:39Z


No system test failures, so pushing the fix.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T15:29:27Z


James wrote:

No system test failures, so pushing the fix.

commit:wres|ff9d924eb11d9034c8f6aa84cb39b5bae0204d74.

On hold until I hear something from USGS about reporting a potential bug in the @methodID@ filter.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T15:57:32Z


Received an initial response from my contact there who cc:d my e-mail to @gs-w-iow_po_team@usgs.gov@. Awaiting further input.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T16:13:37Z


Currently engaged in a conversation w/ USGS out-of-band. They have responded very promptly, which is appreciated.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-04-18T20:34:05Z


Need to follow up w/ USGS on this one, 6.13.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-05-19T12:10:05Z


( Just sent a follow-up e-mail on this one. )

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T10:44:16Z


Sent another follow-up. No joy, after their initial responsiveness. Sigh.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T10:46:16Z


Demoting to the backlog because this isn't being worked.

Also noting that we could filter by @methodID@ on our end, which is #70498, but it would be cleaner to pose a more refined request, if the API allowed for it, and it seems to present as doing so, but with a bug, hence the exchange w/ USGS (TBD).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T10:48:01Z


James wrote:

Sent another follow-up. No joy, after their initial responsiveness. Sigh.

( Ah, just got an undeliverable notification, so perhaps that person has moved on. But I addressed this one to multiple people, so would still hope for a reply. )