DOI-USGS / national-flow-observations

This repository pulls national flow data from NWIS
Other
4 stars 8 forks source link

Build pipeline! #7

Closed lindsayplatt closed 4 years ago

lindsayplatt commented 4 years ago

Adjusted pipeline where necessary & then built. Also added 30_data_summarize to capture some of the summarization that is used in gages-through-ages but also useful elsewhere.

This should fix #3. Below are some diagnostics about the pipeline artifacts from this particular build (week of June 1, 2020).

library(scipiper)
library(dplyr)

Number of DV observations and unique sites pulled

nwis_dv <- readRDS(sc_retrieve('10_nwis_pull/out/nwis_dv_data.rds.ind')) 

nrow(nwis_dv)
[1] 232430029

length(unique(nwis_dv$site_no))
[1] 24311

Number of UV observations and unique sites pulled

nwis_uv <- readRDS(sc_retrieve('10_nwis_pull/out/nwis_uv_data.rds.ind'))

nrow(nwis_uv)
[1] 109534330

length(unique(nwis_uv$site_no))
[1] 575

Number of combined daily flow observations, range of data, and breakdown of daily flow through time

nwis_all <- readRDS(sc_retrieve('20_data_munge/out/daily_flow.rds.ind'))

nrow(nwis_all)
[1] 233383058

range(nwis_all$date)
[1] "1857-02-01" "2020-06-05"

rm(nwis_dv, nwis_uv) # had to clear most other things bc it needed more memory
nwis_all_summary <- nwis_all %>% 
  mutate(year = as.numeric(format(date, "%Y"))) %>% 
  group_by(year) %>% 
  summarize(n_obs = n())
plot(nwis_all_summary$year, nwis_all_summary$n_obs)

image Note: low point at end is because we are only mid-way through 2020

Number of active (>335 days) sites per year & year range of active data

nwis_active_gages <- readRDS(sc_retrieve('30_data_summarize/out/active_flow_gages.rds.ind'))

range(as.numeric(nwis_active_gages$year))
[1] 1863 2020

plot(as.numeric(nwis_active_gages$year), nwis_active_gages$n_gages_per_year)

image Note: low point at end is because we are only mid-way through 2020

Number of unique actives sites through all time + number of continuous active sites & when the continuous sites were built (earliest year of continuous site data)

nwis_active_gages_info <- readRDS(sc_retrieve('30_data_summarize/out/active_flow_gages_summary.rds.ind'))

nrow(nwis_active_gages_info)
[1] 22365

continuous_gages <- nwis_active_gages_info %>% filter(!any_gaps)
nrow(continuous_gages)
[1] 16841

continuous_gages_start_year <- continuous_gages %>% 
  group_by(earliest_active_year) %>% 
  summarize(n_gages = n())
plot(continuous_gages_start_year$earliest_active_year, continuous_gages_start_year$n_gages)

image

lindsayplatt commented 4 years ago

@limnoliver added the diagnostics as a yml file and a plot image.