Develop benchmark queries

lewfish commented 2 years ago

The following is information sent to us by Fernando about typical queries that NOAA runs on NWM. We would like to develop a set of benchmark queries based on this information.

Overall, the team does three combinations of queries that are probably very familiar to your team that deals with spatial-temporal data:

Spatial: these queries are for data that intersect or are contained within a given spatial domain. While spatial domains vary greatly, common definitions include political boundaries (states, counties, etc), hydrologic boundaries (HUCs), forecasting boundaries (River forecast center jurisdictions), or any user specified one. Additionally, a spatial query may have to do with stream network topology. Users frequently want to know data samples that correspond to a given point such as a USGS gage but also would like to know data points upstream of said point or within a given buffer lying on the topology of the river network.
Temporal: these queries are for certain date/time points or ranges. To your question, these are very common. Due to the high resolution of temporal data (1hr), users frequently are interested in temporal aggregations (daily max, weekly mean, monthly median, etc). Temporal queries are the back-bone of creating streamflow vs time relationships at a given location known as hydrographs.
Meta-data: these queries call for certain values that match given meta-data criteria. The meta-data can be included in NWM outputs (think feature ID or also the specific horizon or type of forecast such as medium range or reanalysis or more). Other queries include columns that might not be included in the NWM outputs with the most common being USGS gage, NHDPlusHR identifier, or HUC membership. While these values are related to spatial queries, they can be indexed to save some time from repeating spatial queries. Some ability to quickly join/merge with external datasets on given variables would be very helpful and utilized.

Obviously, most queries to NWM data combine these three categories in unique ways.

More concretely, here are some typical queries:

All stream flows within a given set of HUC8's and a date/time range. Samples are then aggregated to daily averages/max/min.
All hourly stream flows at a given steam location that corresponds to a USGS gage for a given date/time location. Also include all samples 5km upstream of the USGS gage in all possible directions. A 99th percentile is then calculated for the given data at each ID.
Also, we work with different target variables other than stream flow. Often, mean areal precipitation is calculated within a given date/time range upstream of a given location or corresponding to a given HUC.

lewfish commented 2 years ago

We will initially focus on the following query using the reanalysis dataset, since it already all stored in a cloud-friendly format and we can just focus on re-formatting it different ways.

"All stream flows within a given set of HUC8's and a date/time range. Samples are then aggregated to daily averages/max/min."

It's not clear if we should be running these aggregations for each stream individually, or across all streams. It's also not clear if "daily averages" should be averaged across all days in the dataset, or we should be computing an average for each individual day. I would also like to know typical values for the number of HUC8s, and the length of date/time range.

lewfish commented 2 years ago

Some clarification from Fernando:

"In terms of date/time ranges, I can't give you a specific one. Someone interested in large time-scales may want to do the entire 40 year history. Others may just want a time domain pertaining to a given flood event and those can vary from a few hours to weeks.

Yes, temporal aggregations target different resolutions. A user may want a time domain for a year but then get daily max's or weekly min's (if working with drought). These are common but yes the target resolution could be the same as the time domain (user pulls a year of data and just wants to find the max for the year).

Lastly, I would say the most common aggregations are across a time domain for the given Feature IDs. Occasionally, you will find a need to aggregate for a spatial area across Feature ID's."

azavea / noaa-hydro-data

Develop benchmark queries #53