Open epag opened 2 months ago
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-03T19:38:14Z
The general rule is as follows.
Consider all time-series supplied by a source or service unless the user has supplied a filter to de-duplicate time-series.
The work in this ticket is to implement the general rule, but also to warn when duplicates are discovered as a convenience to users and to provide users with the options they need to de-duplicate time-series by filtering, if desired.
For example, when considering duplicates from NWIS, it should be possible to filter on the basis of whatever distinguishes the duplicates using an appropriate api parameter (e.g., transmitted via a WRES URL parameter), else to provide a filter that can be applied downstream of reading.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-03T19:40:31Z
There's an example dataset in #104572-25. To create an example evaluation, use this on both the left and right sides of the evaluation. Absent filters, that evaluation should produce as many paired time-series as time-series read from the service. With filters, it should produce a subset, depending on the time-series to be de-duplicated.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-03T19:42:05Z
Current behavior is to ignore (edit: albeit with a logged warning) any features for which duplicate time-series are discovered, which is highly undesirable and confusing to users.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T12:22:49Z
James wrote:
The work in this ticket is to implement the general rule, but also to warn when duplicates are discovered as a convenience to users and to provide users with the options they need to de-duplicate time-series by filtering, if desired.
The first two parts of this should be straightforward. The third part may not be straightforward, depending on the source of duplication and whether nwis offers url parameter(s) to de-duplicate. If not, that will require extra work but, on reflection, that should probably be a separate ticket, the addition of new filters on time-series metadata.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T13:03:51Z
First thing is to get a better log message so that the source of duplication is clearer.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T13:18:21Z
But, before that, need to reproduce with a simple example.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T13:31:37Z
Reproduced with this:
<?xml version='1.0' encoding='UTF-8'?>
<project name="issue113397">
<inputs>
<left>
<type>observations</type>
<source interface="usgs_nwis">https://nwis.waterservices.usgs.gov/nwis/iv</source>
<variable>00060</variable>
</left>
<right>
<type>observations</type>
<source interface="usgs_nwis">https://nwis.waterservices.usgs.gov/nwis/iv</source>
<variable>00060</variable>
</right>
</inputs>
<pair>
<unit>ft3/s</unit>
<feature left="16103000" right="16103000" />
<dates earliest="2022-01-01T00:00:00Z" latest="2023-01-01T00:00:00Z"/>
</pair>
<metrics>
<metric><name>sample size</name></metric>
</metrics>
<outputs durationFormat="hours">
<destination type="pairs"/>
<destination type="csv2"/>
</outputs>
</project>
</code>
2023-03-06T13:31:04.785+0000 WARN WatermlReader Skipping site 16103000 because multiple timeseries were discovered for variable 00060 from URI https://nwis.waterservices.usgs.gov/nwis/iv?endDT=2023-01-01T00%3A00%3A00Z&format=json¶meterCd=00060&sites=16103000&startDT=2022-01-01T00%3A00%3A01Z.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T14:04:08Z
Better message, although the time-series still need to be admitted (one-liner):
2023-03-06T14:00:45.916+0000 WARN WatermlReader Discovered 2 duplicate time-series for site 16103000 and variable 00060 from URI https://nwis.waterservices.usgs.gov/nwis/iv?endDT=2023-01-01T00%3A00%3A00Z&format=json¶meterCd=00060&sites=16103000&startDT=2022-01-01T00%3A00%3A01Z. The method qualifiers were: [Method[methodID=321515,methodDescription=[~100 feet downstream]], Method[methodID=42104,methodDescription=[Upstream]]]. The general qualifiers were: [Qualifier[qualifierID=0,qualifierCode=P,qualifierDescription=Provisional data subject to revision.,network=NWIS,vocabulary=uv_rmk_cd], Qualifier[qualifierID=0,qualifierCode=e,qualifierDescription=Value has been estimated.,network=NWIS,vocabulary=uv_rmk_cd], Qualifier[qualifierID=1,qualifierCode=P,qualifierDescription=Provisional data subject to revision.,network=NWIS,vocabulary=uv_rmk_cd]]. The censor codes were: []. The sources were []. The Quality control levels were: []. All of these time-series were admitted. If this behavior is undesired, filtering may be possible.
This clarifies that the primary distinction is on the method:
The method qualifiers were: [Method[methodID=321515,methodDescription=[~100 feet downstream]], Method[methodID=42104,methodDescription=[Upstream]]]
In other words, different gauge locations, something like that.
As to whether NWIS exposes the @methodID@ as a filter, that is TBD but, if not, the preferred solution would be to have that filter added and the less preferred solution would be to add general purpose key-value pair filters to the application itself, but that is another ticket.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T14:50:23Z
Tweaked the message:
2023-03-06T14:47:52.611+0000 WARN WatermlReader Discovered 2 duplicate time-series for site 16103000 and variable 00060 from URI https://nwis.waterservices.usgs.gov/nwis/iv?endDT=2023-01-01T00%3A00%3A00Z&format=json¶meterCd=00060&sites=16103000&startDT=2022-01-01T00%3A00%3A01Z. Duplication means that several time-series were encountered that had overlapping valid times and different metadata. Encountered the following metadata across the duplicate time-series: the method qualifiers were [Method[methodID=321515,methodDescription=[~100 feet downstream]], Method[methodID=42104,methodDescription=[Upstream]]]; the general qualifiers were [Qualifier[qualifierID=0,qualifierCode=P,qualifierDescription=Provisional data subject to revision.,network=NWIS,vocabulary=uv_rmk_cd], Qualifier[qualifierID=0,qualifierCode=e,qualifierDescription=Value has been estimated.,network=NWIS,vocabulary=uv_rmk_cd], Qualifier[qualifierID=1,qualifierCode=P,qualifierDescription=Provisional data subject to revision.,network=NWIS,vocabulary=uv_rmk_cd]]; the censor codes were []; the sources were []; and the quality control levels were []. All of the duplicate time-series were admitted. If this behavior is undesired, filtering may be possible.
Also admitted the time-series through, so the evaluation now passes.
2023-03-06T14:47:56.811+0000 INFO Evaluation Closed evaluation P3NRvwq-gM3ywFOa8A3yT6_6Gho with status EVALUATION_COMPLETE_REPORTED_SUCCESS. This evaluation contained 1 evaluation description message, 1 statistics messages, 0 pairs messages and 5 evaluation status messages. The exit code was 0.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T14:51:10Z
Next question: if a user wanted to de-duplicate against method ID, could they do that using the NWIS API?
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T14:54:29Z
Seemingly not. For example, adding an arbitrary method id:
That returns the same time-series as before.
Likewise, if I include a legitimate method ID:
No dice.
But if I include an arbitrary url parameter, @foo=bar@, I do get a 400 error code (@unrecognized keyword: foo [position=0], server=[nadww01]@):
So perhaps the methodID is part of the API, but not respected, so perhaps that is a bug. I will try to report it.
I would prefer not to add general purpose time-series filters until we actually need them. I think this de-duplication problem should be solvable through a url parameter on the nwis request.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T15:00:14Z
Looks like their feedback form is not found:
https://answers.usgs.gov/cgi-bin/gsanswers?tmplt=1&pemail=gs-w_waterdata_support
I will need to dig out an old contact and see how best to report bugs.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T15:09:28Z
Contact found, e-mail sent. I will update the ticket if/when I get a reply and then create a new WRES ticket, as needed.
In the mean time, I will run the system tests and then push the rudimentary solution to this ticket (treat NWIS like any other source/service, but without filters to aid in de-duplication, yet).
It's perfectly possible that some of the 7xx or other system tests that rely on NWIS will be broken on additional benchmark datasets, TBD.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T15:28:39Z
No system test failures, so pushing the fix.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T15:29:27Z
James wrote:
No system test failures, so pushing the fix.
commit:wres|ff9d924eb11d9034c8f6aa84cb39b5bae0204d74.
On hold until I hear something from USGS about reporting a potential bug in the @methodID@ filter.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T15:57:32Z
Received an initial response from my contact there who cc:d my e-mail to @gs-w-iow_po_team@usgs.gov@. Awaiting further input.
Original Redmine Comment Author Name: James (James) Original Date: 2023-03-06T16:13:37Z
Currently engaged in a conversation w/ USGS out-of-band. They have responded very promptly, which is appreciated.
Original Redmine Comment Author Name: James (James) Original Date: 2023-04-18T20:34:05Z
Need to follow up w/ USGS on this one, 6.13.
Original Redmine Comment Author Name: James (James) Original Date: 2023-05-19T12:10:05Z
( Just sent a follow-up e-mail on this one. )
Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T10:44:16Z
Sent another follow-up. No joy, after their initial responsiveness. Sigh.
Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T10:46:16Z
Demoting to the backlog because this isn't being worked.
Also noting that we could filter by @methodID@ on our end, which is #70498, but it would be cleaner to pose a more refined request, if the API allowed for it, and it seems to present as doing so, but with a bug, hence the exchange w/ USGS (TBD).
Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T10:48:01Z
James wrote:
Sent another follow-up. No joy, after their initial responsiveness. Sigh.
( Ah, just got an undeliverable notification, so perhaps that person has moved on. But I addressed this one to multiple people, so would still hope for a reply. )
Author Name: James (James) Original Redmine Issue: 113397, https://vlab.noaa.gov/redmine/issues/113397 Original Date: 2023-03-03
Given an evaluation that selects "duplicate" time-series from NWIS for a given feature When that evaluation proceeds Then the duplicates should be treated in the same way they're treated when reading from others sources and services So that the software is predictable
Redmine related issue(s): 116680