NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, I want to mitigate against bad data for particular issued datetimes #78

Open epag opened 3 weeks ago

epag commented 3 weeks ago

Author Name: James (James) Original Redmine Issue: 103868, https://vlab.noaa.gov/redmine/issues/103868 Original Date: 2022-04-19


Given an evaluation of the nwm When time-series data are requested from a source (e.g., wrds, dstore) that contain several issued times with bad data Then I want to be able to mitigate that in wres

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T12:27:35Z


From e-mail, redacted:

Hank,

Right, they won't work with WRDS. In that case, there may not be a good mitigation. It doesn't sound as though a value constraint would work in this case, because the values are not missing or invalid values of streamflow. We do not allow constraints on issue times or valid times that are more complex than a single interval. There is probably a ticket in that.

Regarding what WRDS serves, I think we prefer dumb data services on the whole, so anything that asks them to go in there and mess with data is probably not ideal. We should be able to mitigate on our end; in general, any client that works with data from the real world needs to present options to users that can mitigate against data issues, so I think this one is on us (as well as the publisher).

Cheers,

James
On 19/04/2022 13:18, Hank Herr - NOAA Federal wrote:
> James:
>
> Thanks for the feedback.  If you know of a reasonable mitigation, let me know.  Alex runs using WRDS and I don't believe glob patterns will work with WRDS, but I could be wrong.
>
> Fernando:
>
> So the message should read 2z through 16z.  Okay.  Would this impact all NWM data, including AnA, and SRF and MRF generated with basis/reference times between 2z and 16z?  
>
> Lastly, I think we should ask WRDS to not serve the data that is impacted (i.e., remove it from its database).  Its better to serve no data than bad data in my opinion.  Thoughts?
>
> Hank
>
>
>
> On Mon, Apr 18, 2022 at 6:04 PM Fernando Salas - NOAA Federal <SNIP> wrote:
>
>     The data issue started with the 2z reference times and ran through 16z. 17z reference times and beyond would be good. APD can speak best to the why if you'd like to include more information. It's my understanding that the bad data won't be fixed.
>
>     On Mon, Apr 18, 2022 at 3:24 PM James Brown <SNIP> wrote:
>
>         Hank,
>
>         It would be nice to have a better description than "bad" or "corrupt", but I assume you don't have one yet. Regardless, I think it gets the message across. You may want to clarify that the times are issue times and are inclusive (I assume) and that streamflow from the entire domain is potentially incorrect (I assume). We probably want to contact Alex separately and suggest any mitigations that might be able to eliminate the time-series from these issue times (e.g., perhaps we can suggest a glob pattern).
>
>         Cheers,
>
>         James
>         On 18/04/2022 21:06, Hank Herr - NOAA Federal wrote:
>>         All:
>>
>>         Below is a draft of a News item I plan to post tomorrow morning as soon as I can.  I've cc'd Fernando since he let me know about the corrupt data so that he can correct the message or provide additional information.  As an example, Fernando said that there were instances of streamflow for the Mississippi being on the order of 100 CFS.
>>
>>         Please review.  Again, I need to draft a message first thing tomorrow, since it will impact the overnight evaluations tonight. 
>>
>>         Thanks,
>>
>>         Hank
>>
>>         =====================================
>>
>>         Bad NWM Data for April 18, 2022, 3Z to 17Z 
>>
>>         The NWM output data from April 18, 2022, 3Z to 17Z, is corrupt.  This includes data acquired from the NWC dStore archive and served from WRDS.  Evaluations spanning that time period can be expected to be incorrect due to that bad data.  It's unclear at this time if and when (and how) the sources of the NWM data WRES users employ, WRDS and NWC dStore, will be "fixed".  If you have any questions, please contact Hank Herr (Hank[removed]).     
epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T12:30:57Z


I suppose the most direct/obvious solution is to allow a constraint on @issuedDates@ (and probably on -@validDates@- @dates@ and @leadHours@ too) that is more complex than a single interval. In terms of declaration, perhaps the most obvious thing is to allow zero, one or more @issuedDates@ and zero, one or more @validDates@ and zero, one or more @leadHours@ (in some cases, zero is not allowed, depending on other declaration).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T12:32:36Z


But, obviously, allowing something in declaration and actually supporting that declaration are two completely different things.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T12:37:54Z


As an aside, we're a bit inconsistent in how we declare the bounds, sometimes explicitly, sometimes implicitly. Default @minOccurs@ and @maxOccurs@ is 1, btw.

            <xs:element name="unit" type="xs:string" minOccurs="0" maxOccurs="1" />
            <xs:element name="unitAlias" type="unitAlias" minOccurs="0" maxOccurs="unbounded" />
            <xs:element name="featureService" type="featureService" minOccurs="0" maxOccurs="1" />
            <xs:element name="feature" type="feature" minOccurs="0" maxOccurs="unbounded"/>
            <xs:element name="featureGroup" type="featurePool" minOccurs="0" maxOccurs="unbounded"/>
            <xs:element name="gridSelection" type="unnamedFeature" minOccurs="0" maxOccurs="unbounded"/>
            <xs:element name="leadHours" type="intBoundsType" minOccurs="0" />

            <!-- analysisDurations chooses which analysis durations to evaluate #65216, #61593 -->
            <xs:element name="analysisDurations" type="durationBoundsType" minOccurs="0" />
            <xs:element name="dates" type="dateCondition" minOccurs="0" />
            <xs:element name="issuedDates" type="dateCondition" minOccurs="0" />
</code>
epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T12:39:41Z


Not sure about the priority of this ticket (edit: or the target release). Perhaps it is much higher if there is no workaround, i.e., this period of invalid time-series data could continue to compromise nwm evaluations for weeks to come, I suppose.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T12:40:42Z


Added Alex as a watcher to this one.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-04-19T12:43:34Z


For AnA, the data is obtained based on its valid datetime, not issued datetime, so the solution will need to be implemented for @dates@, not probably (you mention @validDates@ as a "probably", though I think you were referring to @dates@). That is, unless I'm misunderstanding something about how the WRES obtains AnA data. The AnA data is served as a long time series from WRDS, though I think (?) WRDS is stitching together multiple sources to obtain that time series.

Agreed on the solutions in #103868-2.

Thanks for creating this ticket,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T12:44:01Z


The @season@ constraint isn't a workaround because we only allow one of those too. I think there was another ticket on relaxing that.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T12:45:48Z


Hank wrote:

so the solution will need to be implemented for @dates@, not probably (you mention @validDates@ as a "probably", though I think you were referring to @dates@).

Yup, @dates@, I forgot that we don't clarify the flavor (it's implicit).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T12:46:50Z


Yes, #51030 would probably be another solution.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T12:49:52Z


( Although a closer look at the use case for #51030 suggests that @validDatesPoolingWindow@ might be the better approach for that one (based on datetime intervals), aka #86646. )

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T12:51:29Z


-Actually, I think I like #86646 as the better solution, the most flexible and probably the easiest to implement and declare too, although our declaration language needs some work more generally.-

edit: Ah, but the goal here is to combine/filter the two periods either side of the bad data into one pool, not to separately pool the time-series either side of the bad data so, yes, a more flexible filter on the @dates@/@issuedDates@/@leadHours@ is probably the best approach.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-04-19T12:59:00Z


I'll see if I can work out a glob pattern that can help with users who use dStore. I'm not that good with glob, so it might take some time. I need a pattern that will exclude files for 4/18/2022 from 2Z to 16Z. Tricky.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T13:04:30Z


It may be easier to list the individual blobs, I don't know whether that is possible.

epag commented 3 weeks ago

Original Redmine Comment Author Name: alexander.maestre (alexander.maestre) Original Date: 2022-04-19T13:24:20Z


James wrote:

Added Alex as a watcher to this one.

Thank you for adding me.

Checking the data...

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T13:56:20Z


James wrote:

I suppose the most direct/obvious solution is to allow a constraint on @issuedDates@ (and probably on -@validDates@- @dates@ and @leadHours@ too) that is more complex than a single interval. In terms of declaration, perhaps the most obvious thing is to allow zero, one or more @issuedDates@ and zero, one or more @validDates@ and zero, one or more @leadHours@ (in some cases, zero is not allowed, depending on other declaration).

I think the simplest thing is to not complicate the declaration any more than it already is. In this (hopefully) rare case, the workaround I favor is to curate the dataset by hand and run the evaluation using that dataset.

Hank wrote:

I'll see if I can work out a glob pattern that can help with users who use dStore. I'm not that good with glob, so it might take some time. I need a pattern that will exclude files for 4/18/2022 from 2Z to 16Z. Tricky.

The NWM reader is not file based anymore, so I don't think that would work. I think the workaround here is to curate the dataset by hand and run the evaluation against that.

A less laborious workaround is to ignore the results of any evaluation that included the bad data.

Another option (in between) is to use existing tools or update our tools that make it easy to convert NWM data into usable form (in other words, assist with the job of curating the dataset).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T13:59:06Z


I don't think this is a rare case. It happens quite often. By total coincidence, I was just discussing exactly the same problem with the HEFS folks. So I think we need to provide appropriate tools. Messing with data is not the answer. Simple things should be simple. More complex things should be allowed to be more complex. Relaxing the @maxOccurs@ does not complicate the declaration of simple things.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T14:04:02Z


I think #86646 would be good too.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T14:05:53Z


If you have data corruption, I'm sorry, but I think that is a far bigger issue outside WRES.

If data corruption happens regularly, again, that's outside WRES.

But perhaps we can build an AI data corruption flagging tool, wouldn't that be nice?

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T14:07:08Z


Jesse wrote:

I think #86646 would be good too.

Yeah, I thought so at first, but it isn't actually the problem here, which is a filtering problem not a pooling problem. Issue #86646 doesn't propose to add more complex intervals for pools, rather to allow a explicitly declared pool sequence. The goal in this use case is to omit some data from a pool, not to create N pools.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T14:07:59Z


Jesse wrote:

If you have data corruption, I'm sorry, but I think that is a far bigger issue outside WRES.

If data corruption happens regularly, again, that's outside WRES.

But perhaps we can build an AI data corruption flagging tool, wouldn't that be nice?

Sorry but real data = bad data sometimes, it just does.

By all means, ask the publisher to correct, as I suggested we should do.

We also need to provide tools.

Our answer cannot be "sorry, you cannot use wres".

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T14:08:31Z


If the data are corrupt, how can we even count on the issued dates to be correct? What is this so-called corruption that is so easily worked around by a machine?

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T14:09:47Z


James wrote:

Jesse wrote:

I think #86646 would be good too.

Yeah, I thought so at first, but it isn't actually the problem here, which is a filtering problem not a pooling problem. Issue #86646 doesn't propose to add more complex intervals for pools, rather to allow a explicitly declared pool sequence. The goal in this use case is to omit some data from a pool, not to create N pools.

You say tomato I say tomato. To say "I want N pools" is necessarily to say "I do not want any pools except the N pools" meaning to exclude or omit. So it does achieve the goal.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T14:09:59Z


There is no need for AI. The simple use case here is a constrained on @issuedDates@. Incidentally, this is exactly the same situation that HEFS described today too. This is a common situation w/r to bad data, it is usually for a specific set of datetimes.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T14:10:18Z


Also if the data are corrupt, how are they readable at all?

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T14:11:09Z


Jesse wrote:

If the data are corrupt, how can we even count on the issued dates to be correct? What is this so-called corruption that is so easily worked around by a machine?

This is the exact scenario presented. The issue datetimes are fine, the data is not. You can dislike the scenario, but you don't get to redefine it.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T14:13:28Z


Perhaps we have different definitions of "corrupt", I need to go see what these data look like before commenting further. Usually when data are corrupt, you can't even read them successfully.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T14:13:56Z


Jesse wrote:

You say tomato I say tomato. To say "I want N pools" is necessarily to say "I do not want any pools except the N pools" meaning to exclude or omit. So it does achieve the goal.

It doesn't achieve the goal. The use case is a set of time-series, @{A,B,C}@, where @{A,C}@ should be placed in a pool. You cannot do that with explicit pools because we only allow one interval per time dimension in our pool descriptions, not N. The use case is not @{A}@ and @{C}@ in two separate pools.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T14:15:04Z


Jesse wrote:

Perhaps we have different definitions of "corrupt", I need to go see what these data look like before commenting further. Usually when data are corrupt, you can't even read them successfully.

As I said in the e-mail exchange, "corrupt" is not likely to be the correct term here. Regardless, readable data with wrong time-series values = reality with time-series data, it happens quite often (e.g., instrument saturation etc. etc.).

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T14:16:57Z


Ice effects in nwis data is a similar sort of problem (generally speaking, that is - by all means we should seek to understand the cause of this specific use case and what type and cause of "badness" we are dealing with), but nwis provide a nice flag for it and that is another thing we should respect (and has a separate ticket, iirc).

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-04-19T14:25:41Z


Data is readable, to the best of my knowledge. In the News item, I'm not going to use the word "corrupt".

I'm about to send the News item out and need a final review. Please see your emails.

Thanks,

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T14:55:31Z


Hank wrote:

Data is readable, to the best of my knowledge. In the News item, I'm not going to use the word "corrupt".

Good call, yes. If there is an easy way to clarify that it is the data values within variables rather than the metadata or form of the data, that might help, but I have no quick suggestion on how to do that.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T14:56:30Z


I would also not phrase it as something that the WRES team had any part of or has to do anything about, but that's just me, appparently.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-04-19T14:59:34Z


Jesse:

If my message implies the WRES team has anything to do with the issue, please propose a correction. I don't want to imply that, and perhaps I'm just not reading it the way our typical readers will.

All:

I put this in the email, but will put it here, as well:

I think there is still some discussion to be had about mitigating this problem. Essentially, Alex, NW, and MA's evaluations will be invalid until their evaluation period no longer includes the period of bad data. The easiest work-around would likely be for WRDS to remove the data from the service, but that is an "incorrect" solution. WRDS should be a dumb data service, so they should serve what they see. However, any solution on the WRES side (being discussed in the aforementioned ticket) will take a significantly longer period of time to implement, and there appears to be no short-term, "easy" work-around for us to provide to our users. So I'm tempted to ask WRDS to do something on their end even if the solution sucks.

Correct me if I'm wrong, but any solution from us will include (my estimate) days to get into the COWRES and weeks to get into the WRES GUI. It will then require users to make declaration changes which will presumably be undone once the bad data is beyond their evaluation period. So, yes, it would be best to implement something to solve this in the WRES, but that solution will take time and be painful for our users. Again, if I'm wrong, please let me know.

Hence, my leaning toward asking WRDS to do something for us.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-04-19T15:06:43Z


Jesse:

Please look at my most recent email. Does that better convey that its not our fault?

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T15:08:05Z


It was this part that implies the WRES team has some responsibility or needs to take action beyond this notification:

The WRES team will work to identify a work-around or mitigation for this issue as soon as possible.

So I guess I would not include that. But I agree that if we can regularly expect bad data and that somehow we can do something to flag it or work around it, yeah, we can work on that, but I don't agree that we "will" do it nor that we will do it "as soon as possible." It will take some time as you said. I think it's good to notify as you are already doing.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T15:08:45Z


I am leaning strongly away from asking WRDS to do something. We cannot, on the one hand, hammer the idea of dumb data services and on the other hand ask them to intervene to fix data QC issues. Let's not create that precedent. The publisher isn't going to fix it, apparently. They probably should, but whatever. We cannot ditch this publisher. In general, wres should provide avenues for a user to apply filters to time-series data, whether they stem from data quality control or any other motivation. It should not provide quality control options in itself because it is not a data QC tool. The filters should be sufficiently flexible. Presently, the filters are not sufficiently flexible (e.g., this ticket, #51030). Likewise, the pools (#86646 ).

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-04-19T15:12:47Z


Jesse:

I'll strike that sentence. Don't want to give wrong impressions and the timing of a "mitigation or work-around" is unknown.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T15:14:45Z


Step 1: agree that something should be done. Step 2: agree the thing that should be done.

So, yeah, since we evidently cannot agree among ourselves, I don't see us resolving this anytime soon. Perhaps it will be resolved when a different motivation (unrelated to data QC) requires a similar solution. edit: that said, if Russ et al. require something to be done about it, then I guess we will need to reach agreement sooner, perhaps about the thing that would solve it indirectly and has a separate motivation.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-04-19T15:16:51Z


Anyone object to me asking WRDS to provide a capability to exclude data in a time range?

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-04-19T15:17:39Z


That would allow us to tell our users, including this URL parameter to avoid using the bad data from a known time period in your evaluation. Something like that.

Hank

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T15:18:55Z


I think WRDS already provides an option to select data within a time interval and WRDS is a web service that is designed to be called as many times as needed, so I would push back very strongly if I were them.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T15:23:19Z


James wrote:

Step 1: agree that something should be done. Step 2: agree the thing that should be done.

So, yeah, since we evidently cannot agree among ourselves, I don't see us resolving this anytime soon. Perhaps it will be resolved when a different motivation (unrelated to data QC) requires a similar solution. edit: that said, if Russ et al. require something to be done about it, then I guess we will need to reach agreement sooner, perhaps about the thing that would solve it indirectly and has a separate motivation.

I agree that something should be done: a notification to users that you can't trust the data for the range in question. And that is a simple and effective thing to do. As for what further needs to be done I am open to that. My objection is to "some data provider had an issue and so therefore it is now top priority for WRES to change their software to work around it." I am also open to it becoming top priority if it needs to be but my point is that it is not an obvious top priority. The notification is important and there might be some other stuff we could do to help but it will likely be laborious for either us or our customers or both.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T15:27:57Z


Jesse wrote:

I agree that something should be done: a notification to users that you can't trust the data for the range in question. And that is a simple and effective thing to do. As for what further needs to be done I am open to that. My objection is to "some data provider had an issue and so therefore it is now top priority for WRES to change their software to work around it." I am also open to it becoming top priority if it needs to be but my point is that it is not an obvious top priority. The notification is important and there might be some other stuff we could do to help but it will likely be laborious for either us or our customers or both.

I think you understood that step to be about a change to wres software.

Priority is a separate issue. The ticket is currently neutral about priority. In that regard, you and I both largely respond to users and others in setting priorities; we can have opinions, of course. I wouldn't rank this as a top priority, personally.

In this case, a notification to users is simple, but has limited effectiveness. There is currently no (edit: effective) way to mitigate the problem, only for our users to create awareness among their users.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T15:29:35Z


James wrote:

I am leaning strongly away from asking WRDS to do something. We cannot, on the one hand, hammer the idea of dumb data services and on the other hand ask them to intervene to fix data QC issues. Let's not create that precedent. The publisher isn't going to fix it, apparently. They probably should, but whatever. We cannot ditch this publisher. In general, wres should provide avenues for a user to apply filters to time-series data, whether they stem from data quality control or any other motivation. It should not provide quality control options in itself because it is not a data QC tool. The filters should be sufficiently flexible. Presently, the filters are not sufficiently flexible (e.g., this ticket, #51030). Likewise, the pools (#86646 ).

DS stands for Data Service. WRDS' bread and butter is to provide data. I think the "dumb" applies to the "service" aspect, the interfaces, and so forth, not necessarily to the "data" part. I agree that it is the publisher that should fix the issue and that mitigations can be put in place downstream of the publisher as well. It is probably more important for immediate downstream publishers to mitigate than remotely downstream consumers. At the same time I am not sure what WRDS would be able to do other than delete those data and I also hesitate to ask them to run any delete commands.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T15:33:26Z


Hank wrote:

Anyone object to me asking WRDS to provide a capability to exclude data in a time range?

It's not a bad idea but I would want to more fully understand the problem before proposing solutions. It sounds like you and James have a very firm grasp on how this situation can happen but I am still surprised and scratching my head.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T15:33:34Z


Jesse wrote:

DS stands for Data Service. WRDS' bread and butter is to provide data. I think the "dumb" applies to the "service" aspect, the interfaces, and so forth, not necessarily to the "data" part. I agree that it is the publisher that should fix the issue and that mitigations can be put in place downstream of the publisher as well. It is probably more important for immediate downstream publishers to mitigate than remotely downstream consumers. At the same time I am not sure what WRDS would be able to do other than delete those data and I also hesitate to ask them to run any delete commands.

No, it surely applies to the data part too in my view, not just the service aspects. For the service to be dumb, it cannot be opinionated about the data it serves, it simply passes through all data without opinion other than the opinion injected by the user request. There is a difference between WRDS the service and WRDS the people, but I don't think WRDS the people should be opinionated about the data either, I don't think they own or curate the data, they serve it. Regardless, yes, the publisher should fix and has refused to fix.

epag commented 3 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-04-19T15:38:30Z


Jesse wrote:

Hank wrote:

Anyone object to me asking WRDS to provide a capability to exclude data in a time range?

It's not a bad idea but I would want to more fully understand the problem before proposing solutions. It sounds like you and James have a very firm grasp on how this situation can happen but I am still surprised and scratching my head.

I object to making WRDS more complex by adding multiply intervals. There is a straightforward approach (not even a workaround) for anyone using WRDS and that is to request precisely the time-series data required using as many separate calls as required. The situation is not analogous to WRES, which is not a data service.

I don't believe I indicated that I have a firm grasp of what happened in this precise case, I believe I stated the exact opposite that we should, by all means, seek to understand it insofar as it helps us form an opinion about what wres might do (I see the link as being limited), but it does not sound like the issue is one of corruption of a blob of formatted data (e.g., that occurred during blob transfer), rather a corruption of the process that wrote the data in the first place.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T15:38:38Z


I don't know how much more time we want to waste talking about this but I need to take a break.

epag commented 3 weeks ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2022-04-19T15:44:43Z


I don't really object to the capability of excluding some date ranges optionally, btw. It was the sudden emergency that had me worried, and if it is an emergency capability needed, carry on.