Processing of "estimated" flow data and handling resulting timeseries gaps

slevin75 commented 1 year ago

Discussed in https://github.com/USGS-R/regional-hydrologic-forcings-ml/discussions/4

^{Originally posted by **jds485** November 9, 2021} Ken Eng commented that the "e" (estimated) streamflow code means that the flows are filled in from neighboring streamflow records. These appear as "e" somewhere in the "_cd" field of NWIS downloads. For example "Ae" or "A e". For streamflow, these estimated data are largely used to make visually appealing charts without temporal gaps in them, and should be cautiously employed for modeling applications. As the length of filled-in record increases for a site, it will become increasingly correlated with its neighbors and, as a result, model performance metrics will be artificially improved if that correlation is not considered. Ken mentions examples for which an artificial 95% correlation with neighboring sites occurred, and another example for a physical model that did not perform well during the gap filled in by estimated flows (i.e., they were physically unrealistic for that site). Ken suggests computing the ratio of "e" to non-"e" days in the record to inform using the record or not. I'm thinking annual computation of this ratio would help. Then the follow-up question is: What do we do with the resulting gaps in data? Ideas: We can fill in "e" days for sites with small ratios and short gaps (several days long). We can drop sites with large ratios or long gaps (weeks long). Any further thoughts on using "e" data for predictive modeling?

slevin75 commented 1 year ago

slevin75 on Nov 10, 2021 Maintainer So, just looking at the 5 sites that I used in the pipeline, it looks like pretty much all of the estimated flows are winter flows and they can be in long chunks of time - like all of January and February, just like the sites with a lot of no data. I have a feeling that depending on the person responsible for the site, they either just leave it as no data or they fill it in. I asked one of our data guys about this and this is the response I got:

"I'd say there are 2 primary causes for estimating data: 1 - There isn't any data or it's bad, 2-Ice is affecting the rating. It's USGS policy that we produce a mean daily flow for all our streamgage sites so if our data doesn't produce one we estimate one. Estimations can be done in several ways. Typically, we say if there is a data gap less than a few hours we can do a direct calculation of the mean daily flow but if the data gap is longer, we manually decide if we can just interpolate thru the gaps. For other times (like ice when we have data but the backwater amount is unknown), we use the hydrographs from nearby sites to make the estimates. Cutting ice is kind of an art."

So, maybe we treat these estimated flow days as data gaps, the same as if there were no data for that day. I very much suspect this will primarily be a factor in the winter low flow months and that if it is very widespread, we will have to either fill in the data (easy but maybe not ideal) or dig into the stats functions and see what is going on and if there is another way we could modify those functions to be able to handle this.

I haven't gotten a chance to dig into the EflowStats functions yet but one thing we could try is to alter the functiosn it so that it is only looking at flows from like March through September or something like that. In other words - our 'water year' would only go from spring to fall and would just skip those winter months when we know there are typically no flood events anyway. This might not work for all flow metrics but I think it would work for things like pulse counts and pulse durations if they are computed the way I think they are. There are probably no high pulses during the winter months - we could test this out on some sites that aren't missing data and see how they compare.

2 replies @jds485 Comment options jds485 on Nov 10, 2021 Maintainer Author Thanks for asking one of your data analysts - very insightful response! I'm in favor of treating estimated flows as data gaps - because they actually are data gaps!

We can make plots of frequency of "e" vs. time of year to see the impact on winter and non-winter months.

As you suggest, instead of filling in long timeseries gaps, I'd prefer modifying the metric functions or evaluating some metrics on portions of the water year. I like the idea of testing the method on sites that have full records and comparing metric values obtained for the full record vs. subset.

After others have a chance to comment, I can see 2 issues we can create and work on:

Data analysis for frequency of "e" and other related metrics Analysis of metrics on full year vs. subset of water year @jsadler2 Comment options jsadler2 on Nov 16, 2021 I'm in favor of treating estimated flows as data gaps - because they actually are data gaps!

Makes sense!

Add heading textAdd bold text, <Ctrl+b>Add italic text, <Ctrl+i> Add a quote, <Ctrl+Shift+.>Add code, <Ctrl+e>Add a link, <Ctrl+k> Add a bulleted list, <Ctrl+Shift+8>Add a numbered list, <Ctrl+Shift+7>Add a task list, <Ctrl+Shift+l> Directly mention a user or team Reference an issue, pull request, or discussion Add saved reply Write a comment No file chosen Attach files by dragging & dropping, selecting or pasting them. Styling with Markdown is supported Remember, contributions to this repository should follow our GitHub Community Guidelines. Category 💬 General Labels None yet 3 participants @jds485 @jsadler2 @slevin75 Notifications You’re receiving notifications because you’re watching this repository. Create issue from discussion Events

slevin75 commented 1 year ago

jds485 on Nov 10, 2021 Maintainer Author Thanks for asking one of your data analysts - very insightful response! I'm in favor of treating estimated flows as data gaps - because they actually are data gaps!

We can make plots of frequency of "e" vs. time of year to see the impact on winter and non-winter months.

As you suggest, instead of filling in long timeseries gaps, I'd prefer modifying the metric functions or evaluating some metrics on portions of the water year. I like the idea of testing the method on sites that have full records and comparing metric values obtained for the full record vs. subset.

After others have a chance to comment, I can see 2 issues we can create and work on:

Data analysis for frequency of "e" and other related metrics Analysis of metrics on full year vs. subset of water year @jsadler2 Comment options jsadler2 on Nov 16, 2021 I'm in favor of treating estimated flows as data gaps - because they actually are data gaps!

Makes sense!

USGS-R / regional-hydrologic-forcings-ml

Processing of "estimated" flow data and handling resulting timeseries gaps #180

Discussed in https://github.com/USGS-R/regional-hydrologic-forcings-ml/discussions/4