QAQC flow data - Githubissues

For now, evaluating the results of the temporary fix (#46, #42).

With the PR I'm about to submit to convert cfs to cms for the NWIS data, the two data sources for discharge look very similar now. Here are the lead-time-0 predictions from the long-range member 1 (black dots) compared to NWIS data (blue line):

nl1 <- readRDS('2_munge/out/agg_nwis.rds')
al1 <- readRDS('2_munge/out/agg_nwm_long1.rds')
ggplot(filter(nl1$flow, date > as.Date('2017-02-01')), aes(x=date, y=daily_mean)) + geom_line(color='blue') + facet_grid(site_no ~ ., scale='free_y') + geom_point(data=dplyr::filter(al1, ref_date==valid_date), aes(x=valid_date, y=flow, group=ref_date), size=0.5) + ylab(expression(Discharge~(m^3~s^-1))) + xlab(expression(Date)) + theme_classic()

I've confirmed that members 2-4 are very similar.

And for fun, here's a comparison of NWIS obs (blue line again) to medium-range predictions at lead time = 10 (black dots):

nl <- readRDS('2_munge/out/agg_nwis.rds')
al <- readRDS('2_munge/out/agg_nwm_med.rds')
ggplot(filter(nl$flow, date > as.Date('2017-02-01')), aes(x=date, y=daily_mean)) + geom_line(color='blue') + facet_grid(site_no ~ ., scale='free_y') + geom_point(data=dplyr::filter(al, ref_date==as.Date(valid_date-as.difftime(10, units='days'))), aes(x=valid_date, y=flow, group=ref_date), size=0.5) + ylab(expression(Discharge~(m^3~s^-1))) + xlab(expression(Date)) + theme_classic()

Also believable, and give a sense of how much the lead time affects predictions (quite noticeably, but not so much that 10-day lead times are worthless).

The one sorta funny thing is that you can have a lead time of 0 through 10 - that's 11 days per forecast - isn't that one too many?

> al %>% group_by(site_no, ref_date) %>% summarize(n_valid=length(valid_date))
# A tibble: 984 x 3
# Groups:   site_no [?]
   site_no  ref_date   n_valid
   <fct>    <date>       <int>
 1 05465500 2017-05-09      11
 2 05465500 2017-05-10      11
 3 05465500 2017-05-11      11
 4 05465500 2017-05-12      11
 5 05465500 2017-05-13      11
 6 05465500 2017-05-14      11
 7 05465500 2017-05-15      11
 8 05465500 2017-05-16      11
 9 05465500 2017-05-17      11
10 05465500 2017-05-18      11
# ... with 974 more rows

I've now (in upcoming PR) reworked the aggregation so that the aggregated retro reports means for 1am to 11:59pm on each date; aggregated medium-range reports means for 3am to 11:59pm; aggregated long-range reports means for 6am to 11:59pm. This gives us a similar approach across datasets (albeit with different temporal resolutions and therefore start times) and produces 10 days of medium-range forecasts (lead time 0 days through 9 days) and 30 days of long-range forecasts (lead time 0 days through 29 days).

("lead time 0 days" means, for a long-range example, the forecasts generated at 12am on a reference date and describing flow at 6am, 12pm, 6pm, 12am=11:59pm on that reference date.)

Here are some plots to affirm that the results still make sense:

# medium range, lag 0
nl <- readRDS('2_munge/out/agg_nwis.rds')
al <- readRDS('2_munge/out/agg_nwm_med.rds')
ggplot(filter(nl$flow, date > as.Date('2017-02-01')), aes(x=date, y=daily_mean)) + geom_line(color='blue') + facet_grid(site_no ~ ., scale='free_y') + geom_point(data=dplyr::filter(al, 
  ref_date==as.Date(valid_date-as.difftime(0, units='days'))), aes(x=valid_date, y=flow, group=ref_date), size=0.5) + ylab(expression(Discharge~(m^3~s^-1))) + xlab(expression(Date)) + theme_classic()

# medium range, lag 9
nl <- readRDS('2_munge/out/agg_nwis.rds')
al <- readRDS('2_munge/out/agg_nwm_med.rds')
ggplot(filter(nl$flow, date > as.Date('2017-02-01')), aes(x=date, y=daily_mean)) + geom_line(color='blue') + facet_grid(site_no ~ ., scale='free_y') + geom_point(data=dplyr::filter(al, 
  ref_date==as.Date(valid_date-as.difftime(9, units='days'))), aes(x=valid_date, y=flow, group=ref_date), size=0.5) + ylab(expression(Discharge~(m^3~s^-1))) + xlab(expression(Date)) + theme_classic()

# long range, lag 0
nl <- readRDS('2_munge/out/agg_nwis.rds')
al <- readRDS('2_munge/out/agg_nwm_long1.rds')
ggplot(filter(nl$flow, date > as.Date('2017-02-01')), aes(x=date, y=daily_mean)) + geom_line(color='blue') + facet_grid(site_no ~ ., scale='free_y') + geom_point(data=dplyr::filter(al, 
  ref_date==as.Date(valid_date-as.difftime(0, units='days'))), aes(x=valid_date, y=flow, group=ref_date), size=0.5) + ylab(expression(Discharge~(m^3~s^-1))) + xlab(expression(Date)) + theme_classic()

# long range, lag 29
nl <- readRDS('2_munge/out/agg_nwis.rds')
al <- readRDS('2_munge/out/agg_nwm_long1.rds')
ggplot(filter(nl$flow, date > as.Date('2017-02-01')), aes(x=date, y=daily_mean)) + geom_line(color='blue') + facet_grid(site_no ~ ., scale='free_y') + geom_point(data=dplyr::filter(al, 
  ref_date==as.Date(valid_date-as.difftime(29, units='days'))), aes(x=valid_date, y=flow, group=ref_date), size=0.5) + ylab(expression(Discharge~(m^3~s^-1))) + xlab(expression(Date)) + theme_classic()

USGS-R / protoloads

QAQC flow data #47