Analysis Steps

[ ] Build script to create and save summary analytic values for a given part of our operation
- [x] Analysts must install hydrotools package https://github.com/HARPgroup/hydro-tools
- [x] Get rest account credentials set up in auth.private (harp specific rest account- @nathanielf22 can give other analysts details)
- [x] RWB Gives analysts vahydro accounts for their future use.
- [x] Install hydrotools package
- [x] Base analysis workflow script on https://github.com/HARPgroup/meta_model/blob/main/scripts/met/met_store_info.R
- [x] Review how to retrieve or create the containers for storing data in the database via REST with the following (more details below at REST Find feature, Model and Scenario )
  - [x] Uses fn RomFeature to retrieve the coverage of interest
  - [x] Uses fn om_model_object() to retrieve or create a model property on the feature
  - [x] Uses fn om_get_model_scenario() to retrieve or create a scenario container property on the model
  - [x] Uses fn vahydro_post_metric_to_scenprop() to save the metrics to the scenario container prop
  - [x] Review Using om_vahydro_metric_grid() to retrieve large amounts of data.
- [ ] Analyst begins developing some metrics, visualizations etc in advance of meeting,
  - [ ] At meeting we decide what coverage type, workflow module, branch and step to insert the workflow into, or we modify the analysis to fit a need.
  - [ ] What workflow step does your script belong in?
  - [ ] Create issue for this workflow step
- [ ] Arguments needed:
  - [ ] scenario = met_scenario (i.e., stormVol_nldas2, etc)
  - [ ] location/coverage identifier (like hydrocode in most cases)
  - [ ] Data source file, ex: rating CSV path as input (do not guess the file name! discuss in group meetings)
  - [ ] Others? These will be script specific, detail in specific script issue for workflow step.
- [ ] What metrics will the script create? Some sample metrics includes below
  - [ ] Note: For summarizing regression models, we can save individual monthly $R^2$ as property named jan, feb, mar, ... as propname (perhaps organized below a container property named monthly_rsq)
  - [ ] Further metrics, create and illustrate in a specific script issue for the workflow step

Overview

Using RomDataSource to retrieve/set properties: https://github.com/HARPgroup/meta_model/blob/main/scripts/river/hsp_hydr_analysis.R
- Also uses vahydro_post_metric_to_scenprop() which streamlines the addition of metrics to a model/scenario result summary
Using om_vahydro_metric_grid() to retrieve large amounts of data.
Where should our met data analytics live?
- In the code base
- In the workflow
- In the VAHydro database
What met data analytics will we record (@COBrogan has a bunch!)
Some examples from previous projects
- Add previous met analytics which were not previously workflowed into new met workflows
  - Command is: $NLDAS_ROOT/sqa_met $i met2date $lseg_ftype $model_version "$refresh_years"
    - src: https://github.com/HARPgroup/model_meteorology/blob/main/sh/sqa_met
    - calls:
    - summarize_met
    - qa_met
  - Add to raster_met/wdm/analyze

Data Model & REST

Data Model Outline

Feature (USGS Gage Coverage, or Land Segment, or vahydro River Segment)
- Model (lm_simple, stormVolPlot, cbp-6.0, cbp-6.1, ...)
  - Scenarios: (stormVol_prism, stormVol_nldas2_tiled, simple_lm_nldas2, ...)
    - Scenarios all correspond to a file named [scenario].con in our meteorology config directory on the server, Ex: http://deq1.bse.vt.edu:81/p6/vadeq/config/control/met/stormVol_prism.con
    - Metrics:
    - precip_annual_max_in
    - precip_annual_max_year
    - precip_annual_mean_in
    - precip_annual_min_in
    - precip_annual_min_year
    - precip_hourly_max_in
    - precip_daily_max_in
    - l90_precip_in
      - Use the IHA package: suppressPackageStartupMessages(library(IHA))
        
        Install IHA - see "Installation" on front page of https://github.com/HARPgroup/hydro-tools
      - Use group2 to calculate L90 (and a host of other things): https://github.com/HARPgroup/meta_model/blob/342bc955819efd7e0e1f25f3c8364102d2f99237/scripts/river/hsp_hydr_analysis.R#L261
met_file: this will point to a meteorology data file such as http://deq1.bse.vt.edu:81/met/stormVol_prism/precip/
REST Find feature, Model and Scenario
Feature
- Function RomFeature to retrieve object
  - Arguments:
    - hydrocode: unique identifier, ex: usgs_ws_01613900, N51113
    - bundle: general feature class, ex: watershed, landunit
    - ftype: specific feature type (source of data), ex: usgs_full_drainage, cbp6_landseg
- Use: feature <- RomFeature$new(ds, list( hydrocode=coverage_hydrocode, ftype=coverage_ftype, bundle=coverage_bundle), TRUE)
- Returns: a feature object of class RomFeature
- See also: *Detail View of REST Feature Query below
Model
- Function om_model_object()
  - Arguments:
    - feature: an object of class RomFeature
    - model_version: model overall version, ex: pf-1.0, cbp-6.1
  - Use: model <- om_model_object(ds, feature, model_version)
Scenario:
- Function: om_get_model_scenario()
- Arguments:
  - model
  - scenario_name
- Use: scenario <- om_get_model_scenario(ds, model, scenario_name)
Metric:
- Function: vahydro_post_metric_to_scenprop()
- Use: vahydro_post_metric_to_scenprop(pid, varkey, propcode, propname, propvalue, ds)
- Arguments:
  - pid: scenario unique integer value pid
  - varkey: use om_class_Constant for most decimal attributes
  - propcode:
  - propname: required, descriptive variale name
  - propvalue: required
  - ds: RomDatasource
- Ex: vahydro_post_metric_to_scenprop(scenario$pid, 'om_class_Constant', NULL, 'num_records', numrecs, ds)

Detail View of REST Feature Query

feature <- RomFeature$new(
  ds,
  list(
    hydrocode=coverage_hydrocode,
    ftype=coverage_ftype,
    bundle=coverage_bundle
  ),
  TRUE
)

Using om_vahydro_metric_grid to retrieve data


# GET VAHydro 1.0 RIVERSEG l90_Qout DATA
mdf <- data.frame(
  'model_version' = c('cbp-6.1'),
  'runid' = c('stormVol_prism'),
  'metric' = c('precip_annual_max_in'),
  'runlabel' = c('precip_annual_max_in')
)
met_data <- om_vahydro_metric_grid(
  metric = metric, runids = mdf, bundle="watershed", ftype="usgs_full_drainage",
  base_url = paste(site,'entity-model-prop-level-export',sep="/"),
  ds = ds
)

Here is the current function that I updated a bit tonight. This works for any PRISM, daymet, and NLDAS2 data, such as: read.csv("http://deq1.bse.vt.edu:81/met/stormVol_prism/precip/usgs_ws_01613900-PRISM-all.csv")

When calculating a metric, you can run something like this to return the lowest 90-day flow: summary_analytics(prism)$l90_precip_in

`library(tidyr) library(sqldf) library(zoo)

For ANY data

summary_analytics <- function(df){ df <- separate (data = df, col = obs_date, into = c("obs_date","obs_time"), sep = " ") df <- separate (data = df, col = obs_date, into = c("obs_year","obs_month","obs_day"), sep = "-")

creating a yearly summary with each year and its total precip

yearly.summary <- sqldf( "SELECT obs_year, SUM(precip_in) AS total_precip FROM df GROUP BY obs_year" )

summary analytics

precip_annual_max_in <- max(yearly.summary$total_precip)

precip_annual_max_year <- yearly.summary$obs_year[which.max(yearly.summary$total_precip)]

precip_annual_mean_in <- mean(yearly.summary$total_precip)

For min values and years, we can exclude the first and last row since the

current year and first years are incomplete data

precip_annual_min_in <- min(yearly.summary$total_precip[c(-nrow(yearly.summary),-1)]) precip_annual_min_year <- yearly.summary$obs_year[which.min (yearly.summary$total_precip[c(-nrow (yearly.summary),-1)])]

Create daily summary to use for all data. This makes hourly data daily sums.

daily.summary <- sqldf( "SELECT obs_year, obs_month, obs_day, SUM(precip_in) AS total_precip FROM df GROUP BY obs_year, obs_month, obs_day" ) precip_daily_max_in <- max(daily.summary$total_precip)

if else statement evaluates the amount of unique hours and if 24,

then hourly max is taken. If not, hourly max is NA

if(length(unique(df$hr)) == 24){ precip_hourly_max_in <- max(df$precip_in) } else { precip_hourly_max_in <- NA }

Alternatively to a null value for hourly precip in daily data,

we could look at the rainfall distribution table for a

type II storm and multiply by the max P(t)/P(24) value.

this function uses NULL for daily data

l90 using zoo package

l90_precip_in <- min(rollapply(daily.summary$total_precip, width = 90, FUN = mean, fill = NA, align = "right"), na.rm = TRUE)

Qout_zoo <- zoo(as.numeric(df$precip_in), order.by = df$tstime)

Qout_g2 <- data.frame(group2(Qout_zoo))

l90 done using zoo package. IHA did not work.

makes data frame with all 8 metrics

metrics<- data.frame(precip_annual_max_in = precip_annual_max_in, precip_annual_max_year = precip_annual_max_year, precip_annual_mean_in = precip_annual_mean_in, precip_annual_min_in = precip_annual_min_in, precip_annual_min_year = precip_annual_min_year, precip_daily_max_in = precip_daily_max_in, precip_hourly_max_in = precip_hourly_max_in, l90_precip_in = l90_precip_in) }`

I can run through this in detail next week since I know this is a lot to look at. I'll be out of town tomorrow afternoon through Sunday and likely out of service, but I will do my best to respond before I leave or after I get back this weekend. In the meantime, I wanted to make sure this is stored here.

@nathanielf22 hey Nate, thanks a bunch for pushing this up before you headed out. Very much look forward to working with this, and congratulations on the progress.

10/25 meeting:

[ ] Review REST services to input metrics
[x] Github push troubleshooting
[x] Review VAHydro_metric_script.R
[x] new: yearly_metrics.R

@nathanielf22 if you can post up the input variables that created that strange output, I will check on my machine. But, from my recollection, I think that we used the wrong coverage_ftype, and in my testing with another coverage (details below), I got correct results. The ftype should be usgs_full_drainage for those USGS gage watershed features.


scenario_name <- 'daymet2'
coverage_hydrocode <- 'usgs_ws_01673550'
coverage_bundle <- 'watershed'
coverage_ftype <- 'usgs_full_drainage'
model_version <- 'cbp-6.1'
met_file <- 'http://deq1.bse.vt.edu:81/met/daymet/precip/usgs_ws_01673550-daymet-all.csv'

@nathanielf22 -- I have one observation about the script as is. I tested it in full, and it works! The only thing you should update is the variable name summary should be changed to something else. The reason being, there is an existing R function called summary(), and when you assign the output of summary_analytics() to summary that function gets overwritten, and that is a fairly important function that is used a good deal. precip_summary might be a good option :).

@rburghol @COBrogan -- The code we put together last Friday to integrate into the workflow has an issue with the daymet and NLDAS data and should be updated. when including the n>=365days limitation, daymet excludes certain years where it only has 364 days as shown below:

In addition, NLDAS has an issue because yearly summary creation happens before daily summaries, which I switched around. This data is for gage usgs_ws_01613900.

@nathanielf22 The daymet dataset will always have 365 days in each year. They always exclude December 31st on leap years (but include February 29th, which is an unusual set-up). But, remember that daymet was based in the GMT time zone. That means that each day in EST runs from 7AM to 7AM the next day. So, when grouping data by year, it matters if you are grouping by tstime or tsendtime.

In this case, we can trace through our routines and see that we are grouping data by tsendtime (a confusing by-product of how PRISM operates and an assumption we made about daymet based on a daymet-PRISM regression). That means that leap years will always have 366 data points and the year after will have 364. To better demonstrate this, we are essentially beginning our year at 7 AM January 2nd and ending it on 7AM Dec 31. So, since a daymet leap year does not have Dec 31st, it must have 364 days.

My suggestion for now is to include years >= 364 days but let's discuss at our next meeting!

For reference, note that daymet has 365 days each year when grouping by tstime:

select count(ts.rast),extract(year from to_timestamp(ts.tstime))
from dh_timeseries_weather as ts
left join dh_variabledefinition var on ts.varid = var.hydroid
where var.varkey = 'daymet_mod_daily'
group by extract(year from to_timestamp(ts.tstime));

 count | extract
-------+---------
   365 |    1980
   365 |    1981
   365 |    1982
   365 |    1983
   365 |    1984
   365 |    1985
   365 |    1986
   365 |    1987
   365 |    1988
   365 |    1989
   365 |    1990
   365 |    1991

HARPgroup / HARParchive

REST / RomDataSource, metrics and analysis #1354