Closed lindsayplatt closed 4 years ago
Further investigation reveals that those SD sites had data pulled down, but not enough to be considered "active" for 2019. Pulling data down for the sites that didn't meet the mark for "active" in SD will be my next step.
# Why are we missing SD?
# Did Something happen during the data munging?
# First, defining a bbox that is mostly SD but not entirely.
sd_bbox <- c(xmin = -104.471120, ymin = 42.447817, xmax = -96.097503, ymax = 46.192474)
dv_data <- readRDS("10_nwis_pull/out/nwis_dv_data.rds")
dv_sites <- unique(dv_data$site_no)
dv_sites_sd <- readNWISsite(dv_sites) %>%
filter(dec_long_va > sd_bbox[["xmin"]],
dec_long_va < sd_bbox[["xmax"]],
dec_lat_va > sd_bbox[["ymin"]],
dec_lat_va < sd_bbox[["ymax"]]) %>%
pull(site_no)
dv_data_sd <- dv_data %>% filter(site_no %in% dv_sites_sd)
dv_data_sd_2019 <- dv_data_sd %>%
mutate(year = format(Date, "%Y")) %>%
filter(year == "2019")
uv_data <- readRDS("10_nwis_pull/out/nwis_uv_data.rds")
uv_sites <- unique(uv_data$site_no)
uv_sites_sd <- readNWISsite(uv_sites) %>%
filter(dec_long_va > sd_bbox[["xmin"]],
dec_long_va < sd_bbox[["xmax"]],
dec_lat_va > sd_bbox[["ymin"]],
dec_lat_va < sd_bbox[["ymax"]]) %>%
pull(site_no)
uv_data_sd <- uv_data %>% filter(site_no %in% uv_sites_sd)
uv_data_sd_2019 <- uv_data_sd %>%
mutate(year = format(dateTime, "%Y")) %>%
filter(year == "2019")
# Now plot the sites that appear in both.
both_sites_sd_2019 <- readNWISsite(unique(c(dv_data_sd_2019$site_no, uv_data_sd_2019$site_no)))
# Where are those sites?
maps::map("state", "south dakota")
points(both_sites_sd_2019$dec_long_va, both_sites_sd_2019$dec_lat_va)
title("Pre-munge 2019 data")
all_data <- readRDS("20_data_munge/out/daily_flow.rds")
data_2019 <- all_data %>%
mutate(year = format(date, "%Y")) %>%
filter(year == "2019")
gages_2019_sd <- readNWISsite(unique(data_2019$site_id)) %>%
filter(dec_long_va > sd_bbox[["xmin"]],
dec_long_va < sd_bbox[["xmax"]],
dec_lat_va > sd_bbox[["ymin"]],
dec_lat_va < sd_bbox[["ymax"]])
# Where are those sites?
maps::map("state", "south dakota")
points(gages_2019_sd$dec_long_va, gages_2019_sd$dec_lat_va)
title("Post-munge 2019 data")
# Still have all of SD gages
# Now, test which are "active" by manually calculating that here
days_of_data_gages_2019_sd <- data_2019 %>%
filter(site_id %in% unique(gages_2019_sd$site_no)) %>%
group_by(site_id) %>%
summarize(count = n())
active_gages_2019_sd <- days_of_data_gages_2019_sd %>%
filter(count >= 335) %>%
pull(site_id) %>% readNWISsite()
maps::map("state", "south dakota")
points(active_gages_2019_sd$dec_long_va, active_gages_2019_sd$dec_lat_va)
title("Post-munge 2019 data, active gages only")
# After further investigation, it appears that many SD gages are
# dropped due to not having enough data to be counted as "active"
nrow(gages_2019_sd)
# [1] 143
nrow(active_gages_2019_sd)
# [1] 66
days_of_data_gages_2019_sd %>% filter(count < 335) %>% arrange(count) %>% as.data.frame()
# site_id count
# 06479520 22
# 06441000 90
# 06437500 141
# 06354480 214
# 06410000 214
# 05051600 215
# 06401500 222
# 06452320 242
# 06478000 242
# 06408700 271
# 06465700 287
# 06475000 289
# 06449500 292
# 06450500 292
# 06403300 300
# 06447000 300
# 06447230 300
# 06446000 301
# 06446500 301
# 06479215 301
# 06357800 302
# 06464100 302
# 06464500 303
# 06447450 304
# 06442600 305
# 06481480 306
# 06442130 308
# 06452000 308
# 06479438 308
# 06481000 308
# 06479500 309
# 06479525 309
# 06479770 309
# 06480000 309
# 06480650 309
# 06354881 310
# 06441500 310
# 06442900 310
# 06471000 310
# 06471200 310
# 06471500 310
# 06471800 310
# 06481500 310
# 06482000 310
# 06482610 310
# 06478690 311
# 06479010 311
# 06482020 311
# 06485910 311
# 06360500 313
# 06472000 313
# 06473000 314
# 06476000 314
# 06477000 314
# 06478600 314
# 06483950 314
# 06438000 315
# 06485500 315
# 06470878 317
# 05293000 320
# 06425500 320
# 05051300 323
# 06436198 323
# 06437000 323
# 06478500 323
# 05050000 324
# 05291000 324
# 06409000 324
# 06478513 324
# 05290000 325
# 06438500 325
# 06477500 329
# 06410500 330
# 06425100 330
# 06436180 330
# 06404998 334
# 06421500 334
The weird thing is that if I pull that data manually now, many of those gages are now "active". For example, look at the difference between 05050000
last week and this week? 324 vs 365. Could it be that some centers are still updating data for 2019?
# Try pulling data for those that aren't active in 2019
not_active_sd <- days_of_data_gages_2019_sd %>%
filter(count < 335) %>%
pull(site_id)
repull_dv_data <- readNWISdata(siteNumbers = not_active_sd, service="dv", parameterCd = "00060",
startDate = "2019-01-01", endDate = "2019-12-31")
repull_dv_data %>% group_by(site_no) %>% summarize(count = n()) %>% arrange(count) %>% as.data.frame()
# site_no count
# 06479520 84
# 06441000 90
# 06437500 141
# 06354480 214
# 06410000 214
# 05051600 215
# 06401500 222
# 06408700 271
# 06465700 287
# 06478000 290
# 06452320 303
# 06481000 343
# 06464500 345
# 06477500 345
# 06447000 346
# 06480650 352
# 06438000 353
# 06442600 353
# 06473000 353
# 06438500 355
# 06449500 357
# 06442900 358
# 05293000 359
# 06436198 359
# 06483950 360
# 06479010 361
# 06409000 362
# 06477000 362
# 06479438 362
# 06446500 363
# 06447230 363
# 06479525 363
# 06464100 364
# 06470878 364
# 06479500 364
# 06481500 364
# 06482000 364
# 05050000 365
# 05051300 365
# 05290000 365
# 05291000 365
# 06354881 365
# 06357800 365
# 06360500 365
# 06403300 365
# 06404998 365
# 06410500 365
# 06421500 365
# 06425100 365
# 06425500 365
# 06436180 365
# 06437000 365
# 06441500 365
# 06442130 365
# 06446000 365
# 06447450 365
# 06450500 365
# 06452000 365
# 06471000 365
# 06471200 365
# 06471500 365
# 06471800 365
# 06472000 365
# 06475000 365
# 06476000 365
# 06478500 365
# 06478513 365
# 06478600 365
# 06478690 365
# 06479215 365
# 06479770 365
# 06480000 365
# 06481480 365
# 06482020 365
# 06482610 365
# 06485500 365
# 06485910 365
Weird -- I suppose it's possible that those records were added in the interim. However, my impression of the workflow is that for these continuous sites, the path to NWIS is more or less automated and flagged as preliminary, until they are approved records. So they should still show up without someone pressing a button? But that is a big guess on my part.
Maybe one potential output of the pull is a comparison the number of records we expected (inventory) versus what we downloaded. This won't be perfect because I think the n records value that whatNWISdata
returns is number of days of records, or difference between min and max day, but I think it could still identifier outliers. Also, we use a multiplier of 96 for uv, assuming 15-minute data, but there is a lot of variability in the temporal frequency of collection.
And thinking through this now, this still would not have IDd these sites as an outlier, correct? Because the pull did retrieve earlier data, but not 2019 data?
Right, I don't think this would have identified them as outliers because it even pulled down 2019 data, just not as much as we got this second time around.
OK, so I just went through and checked the output from pull tasks associated with each of the missing gages. First, each of those pull tasks have 26-27 other sites that are randomly distributed around the country. The sites that are missing are in the middle of the site list used for the pull in quite a few of the tasks (I just spot-checked some of these manually).
When I read and look at the output from each pull task and summarize how much data exists for each of our missing sites in 2019, I get counts > 335. So, it actually is making me think this isn't a data pull issue. So, either my check this morning about the status of the new data did actually look at the newly pulled data or there is another snag somewhere along the way that eliminates this extra data.
library(dplyr)
source("10_nwis_pull/src/nwis_partition.R")
dv_partition <- partition_inventory(inventory = readRDS('10_nwis_pull/inout/nwis_dv_inventory.rds'),
250000, "200612")
missing_sites <- c("05050000", "05051300", "05051600", "05290000", "05291000",
"05293000", "06354480", "06354881", "06357800", "06360500", "06401500",
"06408700", "06409000", "06410000", "06410500", "06421500", "06425100",
"06425500", "06436180", "06436198", "06437000", "06437500", "06438000",
"06438500", "06441000", "06441500", "06442130", "06442600", "06442900",
"06446500", "06447000", "06447230", "06447450", "06449500", "06450500",
"06452000", "06452320", "06464100", "06464500", "06465700", "06470878",
"06471000", "06471200", "06471500", "06471800", "06472000", "06473000",
"06475000", "06476000", "06477000", "06477500", "06478000", "06478500",
"06478513", "06478600", "06478690", "06479010", "06479215", "06479438",
"06479500", "06479520", "06479525", "06479770", "06480000", "06480650",
"06481000", "06481480", "06481500", "06482000", "06482020", "06482610",
"06483950", "06485500", "06485910")
sd_partition <- dv_partition %>%
filter(site_no %in% missing_sites)
length(unique(sd_partition$PullTask)) # they almost each have their own PullTask
[1] 72
# Get more information about the pull tasks that the missing sites are in
# They all have between 26 and 27 sites in each pull
dv_partition %>%
filter(PullTask %in% sd_partition$PullTask) %>%
group_by(PullTask) %>%
summarize(n = n()) %>%
summary()
PullTask n
Length:72 Min. :26.00
Class :character 1st Qu.:26.00
Mode :character Median :26.00
Mean :26.12
3rd Qu.:26.00
Max. :27.00
# Summarize info about how much data came back from each pull for the missing sites:
how_many_days_in_2019 <- function(pt, missing_sites) {
# Look at the data that was pulled down for that site
pt_data <- readRDS(sprintf("10_nwis_pull/tmp/dv_%s.rds", pt))
pt_sites <- unique(pt_data$site_no)
pt_missing_site <- pt_sites[which(pt_sites %in% missing_sites)]
# Where are the sites in this one PT? Did this manually for a few PullTasks,
# but turned it off when I used this function to do every PullTask
# pt_data_loc <- dataRetrieval::readNWISsite(pt_sites)
# maps::map("state")
# points(pt_data_loc$dec_long_va, pt_data_loc$dec_lat_va)
# Spread around the country
# Look at how many days of data in 2019 just our missing site has
pt_data %>%
mutate(year = format(Date, "%Y")) %>%
filter(year == "2019", site_no %in% pt_missing_site) %>%
group_by(site_no) %>%
summarize(count = n()) %>%
mutate(PullTask = pt)
}
pt_to_test <- unique(sd_partition$PullTask)
summary_of_data_pulls <- purrr::map(pt_to_test, how_many_days_in_2019, missing_sites) %>%
purrr::reduce(bind_rows)
# Total number of missing sites
nrow(summary_of_data_pulls)
[1] 74
# Number of missing sites that count as active
summary_of_data_pulls %>% filter(count >= 335) %>% nrow
[1] 63
When you check the pulls, do you check the raw downloaded files or the munged files? I thought we had ruled out the data munge step as a potential source of error?
We ruled out 20_munge
but not the combiner step at the end of 10_nwis_uv_pull_tasks.yml
. When I checked the raw files, they had the right amount of data ... so I think it is in the combiner at the end of of that loop_tasks
https://github.com/USGS-R/national-flow-observations/blob/master/10_nwis_pull.yml#L111
OK, definitely the combiner step.
library(dplyr)
missing_sites <- c("05050000", "05051300", "05051600", "05290000", "05291000",
"05293000", "06354480", "06354881", "06357800", "06360500", "06401500",
"06408700", "06409000", "06410000", "06410500", "06421500", "06425100",
"06425500", "06436180", "06436198", "06437000", "06437500", "06438000",
"06438500", "06441000", "06441500", "06442130", "06442600", "06442900",
"06446500", "06447000", "06447230", "06447450", "06449500", "06450500",
"06452000", "06452320", "06464100", "06464500", "06465700", "06470878",
"06471000", "06471200", "06471500", "06471800", "06472000", "06473000",
"06475000", "06476000", "06477000", "06477500", "06478000", "06478500",
"06478513", "06478600", "06478690", "06479010", "06479215", "06479438",
"06479500", "06479520", "06479525", "06479770", "06480000", "06480650",
"06481000", "06481480", "06481500", "06482000", "06482020", "06482610",
"06483950", "06485500", "06485910")
# Identify pull tasks with missing sites
pulltasks_to_combine <- c(
"200612_824", "200612_829", "200612_831", "200612_864", "200612_907",
"200612_794", "200612_791", "200612_755", "200612_637", "200612_654",
"200612_611", "200612_886", "200612_576", "200612_528", "200612_480",
"200612_498", "200612_382", "200612_337", "200612_325", "200612_258",
"200612_236", "200612_190", "200612_168", "200612_403", "200612_496",
"200612_578", "200612_374", "200612_884", "200612_716", "200612_209",
"200612_066", "200612_028", "200612_624", "200612_454", "200612_364",
"200612_507", "200612_393", "200612_217", "200612_797", "200612_760",
"200612_786", "200612_727", "200612_400", "200612_667", "200612_302",
"200612_598", "200612_483", "200612_078", "200612_489", "200612_435",
"200612_215", "200612_842", "200612_646", "200612_313", "200612_431",
"200612_293", "200612_721", "200612_923", "200612_564", "200612_650",
"200612_472", "200612_406", "200612_846", "200612_340", "200612_372",
"200612_232", "200612_909", "200612_737", "200612_381", "200612_342",
"200612_866", "200612_320")
source("10_nwis_pull/src/nwis_combine_functions.R")
combine_things <- function(...) {
rds_files <- c(...)
df_list <- list()
for (i in seq_len(length(rds_files))){
flow_dat <- readRDS(rds_files[i])
reduced_dat <- convert_to_long(flow_dat)
df_list[[i]] <- reduced_dat
}
nwis_df <- do.call("bind_rows", df_list)
return(nwis_df)
}
files_to_combine <- sprintf("10_nwis_pull/tmp/dv_%s.rds", pulltasks_to_combine)
# Read in the pulled down raw files and get a count of how many days
# of data they each have in 2019.
counts_2019_before <- purrr::map(files_to_combine, function(fn) {
readRDS(fn) %>%
mutate(year = as.numeric(format(Date, "%Y"))) %>%
filter(year == 2019) %>%
group_by(site_no) %>%
summarize(count = n())
}) %>% purrr::reduce(bind_rows)
# Now, using a non-scipiper version of the combiner function, read the raw files,
# combine, and get a count of how many days of data they each have in 2019.
combined_data <- combine_things(files_to_combine)
counts_2019_after <- combined_data %>%
mutate(year = as.numeric(format(Date, "%Y"))) %>%
filter(year == 2019) %>%
group_by(site_no) %>%
summarize(count = n())
# How many of the "missing sites" have enough data to be active before the combiner?
counts_2019_before %>%
filter(site_no %in% missing_sites) %>%
filter(count >= 335) %>%
nrow()
# [1] 63
# How many of the "missing sites" have enough data to be active after the combiner?
counts_2019_after %>%
filter(site_no %in% missing_sites) %>%
filter(count >= 335) %>%
nrow()
# [1] 0
More specifically, I did narrow it down to the convert_to_long
fxn ... which I thought I had previously eliminated as an option but apparently that was sloppy.
combine_things_no_convert <- function(...) {
rds_files <- c(...)
df_list <- list()
for (i in seq_len(length(rds_files))){
flow_dat <- readRDS(rds_files[i])
# reduced_dat <- convert_to_long(flow_dat)
df_list[[i]] <- flow_dat
}
nwis_df <- do.call("bind_rows", df_list)
return(nwis_df)
}
combined_data <- combine_things_no_convert(files_to_combine)
counts_2019_after <- combined_data %>%
mutate(year = as.numeric(format(Date, "%Y"))) %>%
filter(year == 2019) %>%
group_by(site_no) %>%
summarize(count = n())
# How many of the "missing sites" have enough data to be active before the combiner?
counts_2019_after %>%
filter(site_no %in% missing_sites) %>%
filter(count >= 335) %>%
nrow()
# [1] 63
The actual issue seems to be NA
flow values when flow_cd = "P Ice"
. The NAs get filtered out during the combiner step. The weirder issue is that those NAs are showing up locally but not in our dataRetrieval
pulls.
Navigating to the site pages and downloading the data shows values even with P Ice (https://waterdata.usgs.gov/sd/nwis/dv?cb_00060=on&format=rdb&site_no=05050000&referred_module=sw&period=&begin_date=2019-01-01&end_date=2019-12-31):
However, we get NAs when we pull the data down using readNWISdata or readNWISdv
dataRetrieval::readNWISdv("05050000", parameterCd = "00060",
startDate = "2019-11-21", endDate = "2019-12-31")
agency_cd site_no Date X_00060_00003 X_00060_00003_cd
1 USGS 05050000 2019-11-21 NA P Ice
2 USGS 05050000 2019-11-22 NA P Ice
3 USGS 05050000 2019-11-23 NA P Ice
4 USGS 05050000 2019-11-24 NA P Ice
5 USGS 05050000 2019-11-25 NA P Ice
6 USGS 05050000 2019-11-26 NA P Ice
7 USGS 05050000 2019-11-27 NA P Ice
8 USGS 05050000 2019-11-28 NA P Ice
9 USGS 05050000 2019-11-29 NA P Ice
10 USGS 05050000 2019-11-30 NA P Ice
11 USGS 05050000 2019-12-01 NA P Ice
12 USGS 05050000 2019-12-02 NA P Ice
13 USGS 05050000 2019-12-03 NA P Ice
14 USGS 05050000 2019-12-04 NA P Ice
15 USGS 05050000 2019-12-05 NA P Ice
16 USGS 05050000 2019-12-06 NA P Ice
17 USGS 05050000 2019-12-07 NA P Ice
18 USGS 05050000 2019-12-08 NA P Ice
19 USGS 05050000 2019-12-09 NA P Ice
20 USGS 05050000 2019-12-10 NA P Ice
21 USGS 05050000 2019-12-11 NA P Ice
22 USGS 05050000 2019-12-12 NA P Ice
23 USGS 05050000 2019-12-13 NA P Ice
24 USGS 05050000 2019-12-14 NA P Ice
25 USGS 05050000 2019-12-15 NA P Ice
26 USGS 05050000 2019-12-16 NA P Ice
27 USGS 05050000 2019-12-17 NA P Ice
28 USGS 05050000 2019-12-18 NA P Ice
29 USGS 05050000 2019-12-19 NA P Ice
30 USGS 05050000 2019-12-20 NA P Ice
31 USGS 05050000 2019-12-21 NA P Ice
32 USGS 05050000 2019-12-22 NA P Ice
33 USGS 05050000 2019-12-23 NA P Ice
34 USGS 05050000 2019-12-24 NA P Ice
35 USGS 05050000 2019-12-25 NA P Ice
36 USGS 05050000 2019-12-26 NA P Ice
37 USGS 05050000 2019-12-27 NA P Ice
38 USGS 05050000 2019-12-28 NA P Ice
39 USGS 05050000 2019-12-29 NA P Ice
40 USGS 05050000 2019-12-30 NA P Ice
41 USGS 05050000 2019-12-31 NA P Ice
dataRetrieval::readNWISdata(service = "dv", siteNumbers = "05050000", parameterCd = "00060",
startDate = "2019-11-21", endDate = "2019-12-31")
agency_cd site_no dateTime X_00060_00003 X_00060_00003_cd tz_cd
1 USGS 05050000 2019-11-21 NA P Ice UTC
2 USGS 05050000 2019-11-22 NA P Ice UTC
3 USGS 05050000 2019-11-23 NA P Ice UTC
4 USGS 05050000 2019-11-24 NA P Ice UTC
5 USGS 05050000 2019-11-25 NA P Ice UTC
6 USGS 05050000 2019-11-26 NA P Ice UTC
7 USGS 05050000 2019-11-27 NA P Ice UTC
8 USGS 05050000 2019-11-28 NA P Ice UTC
9 USGS 05050000 2019-11-29 NA P Ice UTC
10 USGS 05050000 2019-11-30 NA P Ice UTC
11 USGS 05050000 2019-12-01 NA P Ice UTC
12 USGS 05050000 2019-12-02 NA P Ice UTC
13 USGS 05050000 2019-12-03 NA P Ice UTC
14 USGS 05050000 2019-12-04 NA P Ice UTC
15 USGS 05050000 2019-12-05 NA P Ice UTC
16 USGS 05050000 2019-12-06 NA P Ice UTC
17 USGS 05050000 2019-12-07 NA P Ice UTC
18 USGS 05050000 2019-12-08 NA P Ice UTC
19 USGS 05050000 2019-12-09 NA P Ice UTC
20 USGS 05050000 2019-12-10 NA P Ice UTC
21 USGS 05050000 2019-12-11 NA P Ice UTC
22 USGS 05050000 2019-12-12 NA P Ice UTC
23 USGS 05050000 2019-12-13 NA P Ice UTC
24 USGS 05050000 2019-12-14 NA P Ice UTC
25 USGS 05050000 2019-12-15 NA P Ice UTC
26 USGS 05050000 2019-12-16 NA P Ice UTC
27 USGS 05050000 2019-12-17 NA P Ice UTC
28 USGS 05050000 2019-12-18 NA P Ice UTC
29 USGS 05050000 2019-12-19 NA P Ice UTC
30 USGS 05050000 2019-12-20 NA P Ice UTC
31 USGS 05050000 2019-12-21 NA P Ice UTC
32 USGS 05050000 2019-12-22 NA P Ice UTC
33 USGS 05050000 2019-12-23 NA P Ice UTC
34 USGS 05050000 2019-12-24 NA P Ice UTC
35 USGS 05050000 2019-12-25 NA P Ice UTC
36 USGS 05050000 2019-12-26 NA P Ice UTC
37 USGS 05050000 2019-12-27 NA P Ice UTC
38 USGS 05050000 2019-12-28 NA P Ice UTC
39 USGS 05050000 2019-12-29 NA P Ice UTC
40 USGS 05050000 2019-12-30 NA P Ice UTC
41 USGS 05050000 2019-12-31 NA P Ice UTC
Need to figure it out, but we noticed this in the gages through the ages viz. The count of gages in 2019 on gages through the ages matches what we have coming out of this pipeline, so something in here must be an issue. Here is a bit of what I did to evaluate this (haven't truly found the source of the error yet).
2019 "active" gages