Closed lindsayplatt closed 2 years ago
Here are where the new NC files are for this data. The variables available here are already formatted and in the units we would use for GLM, which makes the recent unit conversion effort a bit moot. See
Don't use the snow variable - still calculate because that one looks like snow cover. Will need to convert precip from mm/day to m/day. Also, this might be daily data (!!!) which will hopefully translate into performance improvements for #258
Grid seems to be slightly different than the other one because my query cell is not returning any data. I can get data if I ask for just WI, so might just need to make slight adjustments. David suggested looking at https://github.com/USGS-R/necsc-lake-modeling for help to get the new grid right.
gcm_job <- geoknife(
stencil = webgeom('state::Wisconsin'),
fabric = webdata(
url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_mri",
variables = c(
'windspeed_debias',
'tas_debias', # air temp
'rsds_debias', #shortwave radiation
'rsdl_debias', # longwave radiation
'rh_debias', # relative humidity
'prcp_debias'), # precip
times = c('2040-01-01', '2040-01-10')
)
# The default knife algorithm is appropriate
)
wait(gcm_job)
gcm_job_status <- check(gcm_job)$statusType
my_data <- result(gcm_job, with.units = TRUE)
This one and #258 are kind of blurring now because these new debiased GCMs are in daily data, which makes this a much faster download process. I was having some trouble with our reconstructed grid to work just yet, so I did some testing with just a grid over WI.
First, I wanted to test out the different download times for the GCMs and different cell sizes. This first test was only for different GCMs and numbers of cells per query (not all the time periods) BUT I would say even multiplying this by 3 to get a sense of how long it would take if we did all 3 projection periods puts us way ahead of the 16 hours from my test with 5 lakes (4 different queries per GCM + projection period). Also, there doesn't seem to be too much difference based on the number of cells per query.
# This function saves two files - the actual data file (wanted to run times to reflect this step) and
# the run results with three columns `gcm`, `ncells`, and `dur_sec`. I save this out just in case one
# fails so I can still look at the results and not have a bunch of time wasted.
run_test <- function(geom_sf, ncells, gcm, projection_period) {
set.seed(18)
query_cells_sf <- geom_sf[sample(1:nrow(geom_sf), ncells),]
sf_pts_to_simplegeom <- function(sf_obj) {
sf_obj %>%
st_coordinates() %>%
t() %>% as.data.frame() %>%
# Include the cell numbers in the GDP geom so that the
# results can be linked to the correct cells
setNames(sf_obj$cell_no) %>%
simplegeom()
}
query_geom_WGS84 <- sf::st_transform(query_cells_sf, crs = 4326)
query_simplegeom <- sf_pts_to_simplegeom(query_geom_WGS84)
query_dates <- switch(
as.character(projection_period),
'1980_1999' = c('1980-01-01', '1999-12-31'),
'1980_2099' = c('1980-01-01', '2099-12-31')
)
fp <- tryCatch({
query_stats <- system.time({
gcm_job <- geoknife(
stencil = query_simplegeom,
fabric = webdata(
url = sprintf("http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_%s", gcm),
variables = c(
'windspeed_debias',
'tas_debias', # air temp
'rsds_debias', #shortwave radiation
'rsdl_debias', # longwave radiation
'rh_debias', # relative humidity
'prcp_debias'), # precip
times = query_dates
)
# The default knife algorithm is appropriate
)
wait(gcm_job)
gcm_job_status <- check(gcm_job)$statusType
my_data <- result(gcm_job, with.units = TRUE)
out_file <- sprintf("7_drivers_munge/test/new_%s_%s_%s.feather", gcm, projection_period, ncells)
arrow::write_feather(my_data, out_file)
})
dur_sec <- query_stats[["elapsed"]]
fp <- sprintf('7_drivers_munge/test/new_run_results_%s_%s_%s.feather', gcm, projection_period, ncells)
arrow::write_feather(tibble(gcm, ncells, dur_sec), fp)
return(fp)
}, error = function(e) return(""))
return(fp)
}
library(tidyverse)
library(sf)
library(geoknife)
# Test grid over all of WI
wi_sf <- maps::map('state', 'wi', plot=FALSE, fill=TRUE) %>% st_as_sfc()
wi_grid_sfc <- st_make_grid(wi_sf)
wi_grid_sf <- st_sf(data.frame(cell_no = 1:length(wi_grid_sfc)), geometry=wi_grid_sfc)
wi_grid_centroids <- st_centroid(wi_grid_sf)
# Set up the different tests
test_criteria <- expand.grid(
ncells = c(1,5,10,20),
gcm = c('access', 'cnrm', 'gfdl', 'ipsl', 'mri', 'miroc5'),
projection = c('1980_1999'))
# Run all the tests (saves those two files per run)
pmap_dfr(test_criteria, run_test, geom_sf = wi_grid_centroids)
# There were a few issues with errors near the end that had me re-run just
# a couple of the tests by subsetting `test_criteria` in the above command
# and re-running.
# Save out a single file summarizing all the run results
purrr::map(list.files('7_drivers_munge/test', 'new_run_results_', full.names = T),
arrow::read_feather) %>%
bind_rows() %>%
write_csv("7_drivers_munge/test/new_run_results.csv")
Then, we can look at the results (using the filename pattern)
test_results1 <- read_csv("7_drivers_munge/test/new_run_results.csv")
ggplot(test_results1, aes(ncells, dur_sec)) +
geom_point(size = 2) +
facet_grid(gcm~.)
# Total of ~ 75 min
(test_results1 %>% pull(dur_sec) %>% sum())/60
# Capturing the 24 rows of results here for reproducibility
as.data.frame(test_results1)
gcm ncells dur_sec
1 access 1 39.46
2 access 10 91.69
3 access 20 241.49
4 access 5 68.68
5 cnrm 1 161.36
6 cnrm 10 174.69
7 cnrm 20 280.71
8 cnrm 5 210.16
9 gfdl 1 167.92
10 gfdl 10 167.36
11 gfdl 20 278.96
12 gfdl 5 217.14
13 ipsl 1 169.46
14 ipsl 10 165.11
15 ipsl 20 247.40
16 ipsl 5 213.89
17 miroc5 1 172.96
18 miroc5 10 180.81
19 miroc5 20 273.67
20 miroc5 5 80.12
21 mri 1 167.31
22 mri 10 179.22
23 mri 20 250.75
24 mri 5 214.74
Want to compare what the time would be like to pull from 1980-2099 as opposed to doing each separately.
# Use the same spatial setup as before
library(tidyverse)
library(sf)
library(geoknife)
# Test grid over all of WI
wi_sf <- maps::map('state', 'wi', plot=FALSE, fill=TRUE) %>% st_as_sfc()
wi_grid_sfc <- st_make_grid(wi_sf)
wi_grid_sf <- st_sf(data.frame(cell_no = 1:length(wi_grid_sfc)), geometry=wi_grid_sfc)
wi_grid_centroids <- st_centroid(wi_grid_sf)
# Set up the different tests
test_criteria2 <- expand.grid(
ncells = c(1,5,10,20),
gcm = c('cnrm'),
projection_period = c('1980_2099'))
# Run the tests
test_results2 <- pmap_dfr(test_criteria2, run_test, geom_sf = wi_grid_centroids)
# Write out the results
purrr::map(list.files('7_drivers_munge/test', 'new_run_results_*1980_2099', full.names = T),
arrow::read_feather) %>%
bind_rows() %>%
write_csv("7_drivers_munge/test/new_run_results_full_period.csv")
Now, check out the results and compare to the 1980-1999 times (TBD ... waiting for build but don't want to lose the stuff above)
# File sizes are definitely getting big with the full time period:
filemetadata <- as.data.frame(file.info(list.files('7_drivers_munge/test', 'new_cnrm_1980_2099', full.names=T)))
filemetadata <- filemetadata %>%
mutate(path = rownames(filemetadata)) %>%
mutate(size_mb = round(size / 10^6, 2)) %>%
select(path, size_mb) %>% arrange(size_mb)
rownames(filemetadata) <- NULL
filemetadata
path size_mb
1 7_drivers_munge/test/new_cnrm_1980_2099_1.feather 2.57
2 7_drivers_munge/test/new_cnrm_1980_2099_5.feather 5.44
3 7_drivers_munge/test/new_cnrm_1980_2099_10.feather 8.33
4 7_drivers_munge/test/new_cnrm_1980_2099_20.feather 14.05
# Load the result files and combine with the previous results (but only keep the GCM that matches)
test_results2 <- read_csv("7_drivers_munge/test/new_run_results_full_period.csv")
test_results_together <- test_results2 %>%
mutate(query_dates = '1980_2099') %>%
bind_rows(mutate(test_results1, query_dates = '1980_1999')) %>%
filter(gcm == "cnrm")
ggplot(test_results_together, aes(ncells, dur_sec, color = query_dates)) +
geom_point(size = 2) +
facet_grid(gcm~.)
test_results_compare <- test_results_together %>%
pivot_wider(names_from = query_dates, values_from = dur_sec) %>%
arrange(ncells) %>%
mutate(relationship = round(`1980_2099` / `1980_1999`, 2))
test_results_compare
# A tibble: 4 x 5
gcm ncells `1980_2099` `1980_1999` relationship
<chr> <dbl> <dbl> <dbl> <dbl>
1 cnrm 1 133. 161. 0.83
2 cnrm 5 571. 210. 2.72
3 cnrm 10 560. 175. 3.21
4 cnrm 20 543. 281. 1.93
# It took 30 min to run 4 queries on the full time period.
round(sum(test_results_compare$`1980_2099`)/60)
# It took 14 min to run 4 queries on the first projection period alone
round(sum(test_results_compare$`1980_1999`)/60)
Running the full time period (1980-2099) takes about double the time to run just one of the three time periods alone. This is great because that means that running the full time period at once would take ~2/3 the time as running each period individually. The downside to this is that file sizes are much much larger (the feather file for 20 cells with only the first period is 4.3 MB but the feather file for 20 cells with the full period is 13.4 MB). As we scale to more lakes/cells, this could be a problem.
Some weirdness with the dates:
library(geoknife)
gcm_job <- geoknife(
stencil = webgeom('state::Wisconsin'),
fabric = webdata(
url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_mri",
variables = c(
'windspeed_debias',
'tas_debias', # air temp
'rsds_debias', #shortwave radiation
'rsdl_debias', # longwave radiation
'rh_debias', # relative humidity
'prcp_debias'), # precip
times = c('1980-01-01', '2099-12-31')
)
# The default knife algorithm is appropriate
)
wait(gcm_job)
gcm_job_status <- check(gcm_job)$statusType
my_data <- result(gcm_job, with.units = TRUE)
summary(my_data$DateTime)
Min. 1st Qu. Median
"1981-01-01 00:00:00" "1995-10-02 07:44:17" "2049-07-01 19:13:16"
Mean 3rd Qu. Max.
"2043-03-02 19:13:16" "2084-04-01 06:42:14" "2098-12-31 14:26:33"
The min is weirdly Jan 1 1980 and the max is Dec 31 2098. I was expecting 1980 as the min and 2099 as the max.
I think if you leave times
as NA
you would get the whole time range of the dataset, but I'm also surprised by your result here.
Figuring out the new grid params:
nc <- ncdf4::nc_open('http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias/files/debias_windspeed_mri_mid21_longerdaymet_ghcn_365d.nc')
unique(diff(ncdf4::ncvar_get(nc, 'jx'))) # For cellsize
unique(diff(ncdf4::ncvar_get(nc, 'iy'))) # Confirming cellsize
min(ncdf4::ncvar_get(nc, 'jx')) # For xmin
min(ncdf4::ncvar_get(nc, 'iy')) # For ymin
length(ncdf4::ncvar_get(nc, 'jx')) # For nx
length(ncdf4::ncvar_get(nc, 'iy')) # For ny
Using those results, I will be editing the grid params to this tibble:
tibble(
crs = "+proj=lcc +lat_0=45 + lon_0=-97 +lat_1=36 +lat_2=52 +x_0=0 +y_0=0 +ellps=WGS84 +units=m",
cellsize = 25000,
xmin = -2700000,
ymin = -1750000,
nx = 217,
ny = 141
)
I think the DateTime
column is the first day of the year and then you have to add the time (day of year)
column to it to get the real dates ...
head(raw_data)
DateTime 16188 16402 variable statistic time(day of year)
1 1981-01-01 0.00000000 0.0000000 prcp_debias MEAN 0
2 1981-01-01 0.00000000 0.0230931 prcp_debias MEAN 1
3 1981-01-01 3.36814360 3.8792896 prcp_debias MEAN 2
4 1981-01-01 1.70109210 2.6726415 prcp_debias MEAN 3
5 1981-01-01 0.01876353 0.5939542 prcp_debias MEAN 4
6 1981-01-01 1.41612720 0.7077347 prcp_debias MEAN 5
> length(which(as.Date(raw_data$DateTime) == "1981-01-01"))
[1] 2190
> 2190/6 # How many Jan 1 1981s per GCM?
[1] 365
Yeah ... that is definitely the case:
raw_data %>%
mutate(date = as.Date(DateTime) + `time(day of year)`) %>%
select(DateTime, `time(day of year)`, date) %>%
summary()
DateTime time(day of year) date
Min. :1981-01-01 00:00:00 Min. : 0 Min. :1981-01-01
1st Qu.:1995-10-02 07:44:17 1st Qu.: 91 1st Qu.:1995-12-31
Median :2049-07-01 19:13:16 Median :182 Median :2049-12-30
Mean :2043-03-02 19:13:16 Mean :182 Mean :2043-08-31
3rd Qu.:2084-04-01 06:42:14 3rd Qu.:273 3rd Qu.:2084-12-29
Max. :2098-12-31 14:26:33 Max. :364 Max. :2099-12-30
Not sure why we aren't getting 2099-12-31 as the max date though? Is that a timezone thing?
I was just able to run the pipeline accessing the new GCMs. For two tiles x 6 GCMs x all time at once, there were 12 total queries and that took 60 min. The longest query took 7 min, which is a far cry from my test where the longest individual query took 40 min (not to mention the number of queries are cut by 3 times thanks to doing all time periods at once).
# Used the function from this comment to run these
# https://github.com/USGS-R/lake-temperature-model-prep/issues/258#issuecomment-999763969
> # How long did the GDP download take and how big are the resulting files?
> tar_load(gcm_data_raw_feather)
> target_metadata_download <- summarize_target_build(names(gcm_data_raw_feather))
-- DURATION --
Shortest: 3.106 min
Longest: 7.651 min
Total: 60.677 min
-- FILE SIZES --
Smallest: 3.515 mb
Biggest: 3.562 mb
Total: 42.517 mb
>
> # How long did the munge take and how big are the resulting files?
> tar_load(glm_ready_gcm_data_feather)
> target_metadata_munge <- summarize_target_build(names(glm_ready_gcm_data_feather))
-- DURATION --
Shortest: 0.012 min
Longest: 0.132 min
Total: 0.288 min
-- FILE SIZES --
Smallest: 1.992 mb
Biggest: 2.038 mb
Total: 24.23 mb
Running this using a rudimentary match to MN just to do a slight scaling up. Will see what results are on Monday! 9 tiles with 363 cells total. The total number of lakes is 14,721.
tar_load(query_cells_centroids_list_by_tile)
lapply(query_cells_centroids_list_by_tile, nrow) %>% unlist() %>% unname()
69 24 3 12 100 26 17 69 43
lake_cell_xwalk_df %>% group_by(cell_no) %>% summarize(n = n()) %>% arrange(desc(n))
# A tibble: 363 x 2
cell_no n
<int> <int>
1 17696 238
2 16387 227
3 16610 202
4 18137 189
5 16604 187
6 17042 181
7 16821 176
8 15954 172
9 17255 163
10 18138 161
# ... with 353 more rows
Using the more sophisticated method from #278, there are 14,678 MN lakes (which is actually slightly fewer than my previous approach). Note that the query_cells_centroids_list_by_tile
is the same, so it doesn't change the cells we are querying.
Currently running the downloads with the GDP fix noted here. For MN, we have 9 different tiles with varying numbers of cells in each. Each tile = the spatial limits for one GDP query. With 9 tiles and 6 different GCMs (pulling all dates at once), we end up with 54 total queries to GDP.
# Merge lake counts per cell with the tile and cell centroid spatial info
library(targets)
library(sf)
library(tidyverse)
tar_load(query_cells_info_df)
tar_load(query_cells_centroids_list_by_tile)
tile_list_w_counts <- purrr::map(query_cells_centroids_list_by_tile, function(x) {
left_join(x, dplyr::select(query_cells_info_df, -x, -y), by = "cell_no")
}) %>% bind_rows()
# Make some maps
# 9 different tiles * 6 GCMs = 54 total queries for all of MN
tile_list_w_counts %>%
dplyr::mutate(tile_no = as.character(tile_no)) %>%
ggplot(aes(color = tile_no)) + geom_sf(size = 2) + theme_void() +
ggtitle("Cell centroids by tile (tile = one GDP query)")
tile_list_w_counts %>%
ggplot(aes(color = n_lakes)) + geom_sf(size = 2) +
scico::scale_color_scico(palette = 'batlow', direction = -1) +
theme_void()+
ggtitle("Cell centroids by number of lakes per cell")
I think my comment is directed at the future multi-state pull, since it is probably unnecessary optimization if aimed at MN (unless you want to try to see about time-costs of the queries).
Given this
9 tiles with 363 cells total
I'd think all of those cells could be snagged in a single bigger tile (or two, since you don't really want you cell layout to be optimized only for the state of MN...), reducing your number of GDP queries by 9x. Unless you get into the territory of 1000 cells in a query, you should still be able to do these in one call. Unless the pivot table operations are the limiting factor for number of cells.
Just a note - we're moving ahead on the netCDF PR #281 for now, so that I can work on incorporating those netCDFs into the lake-temperature-process-models
pipeline, but we will potentially need to revise the netCDF code depending on the resolution of this issue (e.g., if the grid mapping has in fact changed, or if those cells shouldn't have NA values).
Verifying the projection. Michael Notaro sent us this rcm_map
variable, which is the same as the one from the un-debiased data, which would mean that we are already using the appropriate projection. Validating further just to make sure by 1) plotting air temp for one lake in our current dataset to that of the same lake in the Winslow 2017 data release, and 2) using the un-projected coordinates to make a map of the grid bounding box.
rcm_map
variable:
char rcm_map ; rcm_map:grid_mapping_name = "lambert_conformal_conic" ; rcm_map:standard_parallel = 36., 52. ; rcm_map:longitude_of_central_meridian = -97. ; rcm_map:latitude_of_projection_origin = 45. ; rcm_map:_CoordinateTransformType = "Projection" ; rcm_map:_CoordinateAxisTypes = "GeoX GeoY" ;
Quantiles of difference between AirTemp (negative = Winslow was higher, positive = recent pull was higher). About 80% of the data is within +/- 3 deg
10% 50% 90%
-2.956229 0.199418 3.180503
blue = 1:1 line, red = 10th and 90th percentiles
blue = 0 (no difference), red = 10th and 90th percentiles
Using the lat/long:
Using the projected coordinates:
Code for first two plots (you need to have already built the targets pipeline locally)
library(targets)
library(tidyverse)
library(sbtools)
local_dir <- "7_drivers_munge/test"
if(!dir.exists(local_dir)) dir.create(local_dir)
gcm_name <- "GFDL"
# Downloaded and unzipped gcm from this ScienceBase item locally:
# https://www.sciencebase.gov/catalog/item/57dc37b0e4b090824ffd5663
gcm_local <- file.path(local_dir, gcm_name)
gcm_local_files <- list.files(gsub(".zip", "", gcm_local))
# Define the lake
winslow_lake_to_eval <- "nhd_1099030"
# Got this xwalk file from Hayley
winslow_to_nhdhr_xwalk <- read_csv(file.path(local_dir, "winslow_nhdhr_xwalk.csv"))
# Figure out which nhdhr lake id is equivalent to the Winslow one and then identify the cell
nhdhr_id <- winslow_to_nhdhr_xwalk %>% filter(WINSLOW_ID == winslow_lake_to_eval) %>% pull(site_id)
targets::tar_load(lake_cell_xwalk_df)
cell_of_lake_to_eval <- lake_cell_xwalk_df %>% filter(site_id == nhdhr_id) %>% pull(cell_no)
old_gcm_airtemp <- read_csv(gcm_local_files[grepl(winslow_lake_to_eval, gcm_local_files)]) %>%
select(time, AirTemp) %>% rename(AirTemp_Winslow = AirTemp)
# Use the feather files that we created per cell & combine all three timeperiods
cell_feathers <- targets::tar_read(out_skipnc_feather)
cell_feathers_gcm <- cell_feathers[grepl(gcm_name, cell_feathers)]
cell_feathers_gcm_cell <- cell_feathers_gcm[grepl(sprintf("%s.feather", cell_of_lake_to_eval), cell_feathers_gcm)]
new_gcm_airtemp <- purrr::map(cell_feathers_gcm_cell, arrow::read_feather) %>%
bind_rows() %>% select(time, AirTemp) %>% rename(AirTemp_Recent = AirTemp)
all_gcm_airtemp <- full_join(old_gcm_airtemp, new_gcm_airtemp)
# Some weirdness with mismatched leap years in the new data still
all_gcm_airtemp %>% filter(is.na(AirTemp_Winslow) | is.na(AirTemp_Recent)) %>% View()
ggplot(all_gcm_airtemp, aes(AirTemp_Winslow, AirTemp_Recent)) +
geom_point(shape = 1) +
geom_abline(slope=1, intercept = 0, color = "blue", size = 2) +
geom_quantile(color = "red", quantiles = c(0.10, 0.90)) +
ggtitle("AirTemp", subtitle = sprintf("Lake %s (Winslow: %s)", nhdhr_id, winslow_lake_to_eval))
all_gcm_airtemp_diff <- all_gcm_airtemp %>%
mutate(AirTemp_diff = AirTemp_Recent - AirTemp_Winslow)
# Look at quantile of differences. ~ 80% of data is within +/- 3 deg
quantile(all_gcm_airtemp_diff$AirTemp_diff, probs = c(0.10, 0.50, 0.90), na.rm = TRUE)
# 10% 50% 90%
# -2.956229 0.199418 3.180503
ggplot(all_gcm_airtemp_diff, aes(time, AirTemp_diff)) +
geom_point(shape = 1) +
geom_hline(yintercept = 0, color = "blue", size = 2) +
geom_quantile(color = "red", quantiles = c(0.10, 0.90)) +
ggtitle("AirTemp, difference between Winslow and recent", subtitle = sprintf("Lake %s (Winslow: %s)", nhdhr_id, winslow_lake_to_eval))
Code for the map (need to be on VPN for nc_open()
to work)
library(sf)
library(ncdf4)
nc_proj_str <- "+proj=lcc +lat_0=45 + lon_0=-97 +lat_1=36 +lat_2=52 +x_0=0 +y_0=0 +ellps=WGS84 +units=m"
conus_sf <- st_as_sf(maps::map('state', plot=FALSE, fill=TRUE))
nc <- nc_open('http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias/files/debias_windspeed_mri_mid21_longerdaymet_ghcn_365d.nc')
xmin <- min(ncvar_get(nc, 'xlon')) # For xmin
xmax <- max(ncvar_get(nc, 'xlon')) # For xmax
ymin <- min(ncvar_get(nc, 'xlat')) # For ymin
ymax <- max(ncvar_get(nc, 'xlat')) # For ymax
jxmin <- min(ncvar_get(nc, 'jx')) # For projected xmin
jxmax <- max(ncvar_get(nc, 'jx')) # For projected xmax
iymin <- min(ncvar_get(nc, 'iy')) # For projected ymin
iymax <- max(ncvar_get(nc, 'iy')) # For projected ymax
nc_close(nc)
latlon_extent <- expand.grid(x = c(xmin, xmax), y = c(ymin, ymax))
projected_extent <- expand.grid(x = c(jxmin, jxmax), y = c(iymin, iymax))
nc_bounds_sf <- st_as_sf(latlon_extent, coords = c("x", "y"), crs = 4326)
nc_bounds_proj_sf <- st_as_sf(projected_extent, coords = c("x", "y"), crs = nc_proj_str)
nc_bbox_sf <- st_as_sfc(st_bbox(nc_bounds_sf))
nc_bbox_proj <- st_as_sfc(st_bbox(nc_bounds_proj_sf))
plot(st_geometry(nc_bounds_sf), pch=20, col="red")
plot(st_geometry(nc_bbox_sf), add=TRUE)
plot(st_geometry(conus_sf), add=TRUE)
plot(st_geometry(nc_bounds_proj_sf), pch=20, col="red")
plot(st_geometry(nc_bbox_proj), add=TRUE)
plot(st_geometry(st_transform(conus_sf, nc_proj_str)), add=TRUE)
Starting looking into the 4 missing cells because I was wondering whether or not they had anything to do with the weird projection stuff. I don't think so, but here is something weird. If I look at just one of the cells, 17909
, and query the cell centroid (shown by the blue dot below), I get NaN values. If instead I query the centroid of the lakes (the open black circles) within this cell (shown by the red dot below), I don't get any NaNs. So, I'm not quite sure what is happening with GDP here. Perhaps the grids are actually smaller than we thought they were? I would expect both of these to return the same data if they are within the same cell. @jread-usgs any ideas?
Everything I dig into just opens more questions!!
library(sf)
query_gdp <- function(query_simplegeom) {
gcm_job <- geoknife(
stencil = query_simplegeom,
fabric = webdata(
url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_gfdl",
variables = c("prcp_debias", "tas_debias", "rh_debias", "rsds_debias", "rsdl_debias", "windspeed_debias"),
times = c("1981-01-01 00:00:00", "1981-02-01 00:00:00")
)
# The default knife algorithm is appropriate
)
wait(gcm_job)
data_out <- result(gcm_job, with.units = TRUE)
return(data_out)
}
red_data <- query_gdp(simplegeom(c(-94.87765, 47.79147)))
blue_data <- query_gdp(simplegeom(c(-94.81306, 47.81679)))
wow, that is a surprising outcome...hmmm
We're using the feature-weighted summary algorithm of the GDP (that is the default, so we don't specify it). I know you are aware of that but stating in here to summarize the issue and my understanding of it. That algorithm will weight the areas of sections of the polygon that fall into different cells of the gridded data according to their percentage of area in the cells. We're using that algorithm but with a "point" we're creating a tiny little diamond polygon to use as the feature since GDP doesn't allow us to just use a true point (this is happening behind the scenes in geoknife
, to buffer that point). Whenever the feature polygon is completely within the grid cell, we should get an identical result from GDP regardless of where in that grid cell the feature polygon is. So your red point and the blue point should return the same thing.
So...yes, I'm really surprised by this result.
A couple of things that could cause this:
1) is there a chance your blue dot is actually on the edge of a cell instead of the true centroid? I think this could be the case if the grid construction for the queries didn't match the underlying data grid, and was offset by 1/2 a cell in any or both directions. Then the blue "centroid" would be a diamond that gets a 1/2 or 1/4 of the weighted data from adjacent cells, and the red polygon would fall fully into one of those underlying data cells. Your email to Notaro should help this question too 2) Now I'm thinking about the grid and spatial info...if we didn't know or have confidence what the spatial data were for this dataset when it came in (I think it was missing some spatial attributes we have for other hosted datasets), how does the GDP know what underlying data grid exists and how to mesh up these two sources of data (the points and the underlying grid) together to get the right result?? Any info GDP has for this assumption/calculation would be info we'd have access to as well, but are there different things going on?
for 1), I'm not seeing much evidence that the blue centroid is close to an edge:
library(sf)
library(geoknife)
query_gdp <- function(query_simplegeom) {
gcm_job <- geoknife(
stencil = query_simplegeom,
fabric = webdata(
url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_gfdl",
variables = c("prcp_debias"),
times = c("1981-01-01 00:00:00", "1981-01-02 00:00:00")
)
# The default knife algorithm is appropriate
)
wait(gcm_job)
data_out <- result(gcm_job, with.units = TRUE)
return(data_out)
}
query_gdp(simplegeom(data.frame(centroid = c(-94.81306, 47.81679), left = c(-94.8131, 47.81679), right = c(-94.81294, 47.81679), below = c(-94.81294, 47.81675), above = c(-94.81294, 47.81682))))
DateTime above below centroid left right variable statistic time(day of year) units
1 1981-01-01 NaN NaN NaN NaN NaN prcp_debias MEAN 0 mm/day
2 1981-01-01 NaN NaN NaN NaN NaN prcp_debias MEAN 1 mm/day
3 1981-01-01 NaN NaN NaN NaN NaN prcp_debias MEAN 2 mm/day
So...What about setting up a line of points that inches between the blue and red dot. Is there a point where we get data?
n_guesses = 14
matrix(c(seq(-94.87765, to= -94.81306, length.out = n_guesses), seq(47.79147, to= 47.81679, length.out = n_guesses)), nrow = 2, byrow = TRUE) %>% as.data.frame() %>%
simplegeom() %>% query_gdp()
DateTime V1 V10 V11 V12 V13 V14 V2 V3 V4 V5 V6 V7 V8 V9 variable statistic time(day of year) units
1 1981-01-01 0.006079078 0.006079078 0.006079078 0.006079078 0.006079078 NaN 0.006079078 0.006079078 0.006079078 0.006079078 0.006079078 0.006079078 0.006079078 0.006079078 prcp_debias MEAN 0 mm/day
2 1981-01-01 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 NaN 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 prcp_debias MEAN 1 mm/day
3 1981-01-01 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 NaN 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 prcp_debias MEAN 2 mm/day
4 1981-01-01 6.515005000 6.515005000 6.515005000 6.515005000 6.515005000 NaN 6.515005000 6.515005000 6.515005000 6.515005000 6.515005000 6.515005000 6.515005000 6.515005000 prcp_debias MEAN 3 mm/day
huh, only the blue dot (V14
) is NaN in that case
So perhaps above when I was going left/right/up/down, I should've checked diagonally up/left. I'm thinking the blue dot may actually be at a corner after all??
Update:
I'm not sure that my left, right, up, down nudges were actually wide enough to be outside of the geoknife
buffer of the point. redo:
offset <- 0.01
query_gdp(simplegeom(data.frame(centroid = c(-94.81306, 47.81679), down = c(-94.81306, 47.81679-offset), up = c(-94.81306, 47.81679+offset), left = c(-94.81306-offset, 47.81679), right = c(-94.81306+offset, 47.81679),up_left = c(-94.81306-offset, 47.81679+offset), up_right = c(-94.81306+offset, 47.81679+offset), dwn_left = c(-94.81306-offset, 47.81679-offset), dwn_right = c(-94.81306+offset, 47.81679-offset))))
Process Accepted
DateTime centroid down dwn_left dwn_right left right up up_left up_right variable statistic time(day of year) units
4 1981-01-01 NaN 6.515005000 6.515005000 7.633144000 NaN 7.18019700 NaN NaN 7.18019700 prcp_debias MEAN 3 mm/day
I'm seeing 4 different values here, which I think means we're getting data from either 4 different underlying cells, or the buffered points are blending data slightly differently from ~3 different cells
A sketch of what this return of data looks like
- is there a chance your blue dot is actually on the edge of a cell instead of the true centroid? I think this could be the case if the grid construction for the queries didn't match the underlying data grid, and was offset by 1/2 a cell in any or both directions. Then the blue "centroid" would be a diamond that gets a 1/2 or 1/4 of the weighted data from adjacent cells, and the red polygon would fall fully into one of those underlying data cells. Your email to Notaro should help this question too
- Now I'm thinking about the grid and spatial info...if we didn't know or have confidence what the spatial data were for this dataset when it came in (I think it was missing some spatial attributes we have for other hosted datasets), how does the GDP know what underlying data grid exists and how to mesh up these two sources of data (the points and the underlying grid) together to get the right result?? Any info GDP has for this assumption/calculation would be info we'd have access to as well, but are there different things going on?
Lindsay and I landed on these same questions yesterday when we were chatting. I think it really comes down to that # 2.
I did some more checking of the grid information in the new netCDFs this morning, and while the extent is dramatically different than the old data (and could be incorrect -- hopefully Lindsay's email to Notaro clears this up), the good news is that it does appear that the spatial data the new netCDF contains is internally consistent (e.g., the extent of the grid constructed from the iy
and jx
values in that new netCDF matches the extent of the xlat
and xlon
values in that netCDF). And based on switching between WGS84 and the grid projection (as defined by the grid mapping variable that Notaro emailed, which matches the grid mapping variable in the previously used, biased netCDFs with the midwestern extent), it does seem like the spatial data is in the same crs.
Here blue dots = corners of grid constructed from iy
and jx
values. red dots = corners of xlat
and xlon
grid values. green dots and box = bounding points and box of xlat
and xlon
grid (wider than actual grid becuase grid curves in WGS84). grid = our reconstructed grid (from iy
and jx
values):
And if we use the native grid projection:
So based on that, we are confident that our grid reconstruction matches what the debiased (new) netCDF files have for both iy/jx and xlat/xlon. I think the remaining question is if there is no rcm_map
attribute in the netCDFs that are on GDP, what does GDP assume the grid is in? Do we need to have David add the rcm_map
attribute that we received from Notaro to the data on THREDDs? Then, we could retry those same queries and see if there is a change. Either way, seems like that rcm_map
variable should end up with the original nc files.
Here is what this looks like when calling GDP to ask for a subset of the underlying data based on the bounding box of the points we give it:
query_gdp <- function(query_simplegeom) {
gcm_job <- geoknife(
stencil = query_simplegeom,
fabric = webdata(
url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_mri",
variables = c("prcp_debias"),
times = c("1981-01-01 00:00:00", "1981-01-02 00:00:00"),
),
knife = webprocess("subset", OUTPUT_TYPE="netcdf")
)
wait(gcm_job)
return(gcm_job)
}
offset <- 0.01;
geo_job <- query_gdp(simplegeom(data.frame(centroid = c(-94.81306, 47.81679), down = c(-94.81306, 47.81679-offset), up = c(-94.81306, 47.81679+offset), left = c(-94.81306-offset, 47.81679), right = c(-94.81306+offset, 47.81679),up_left = c(-94.81306-offset, 47.81679+offset), up_right = c(-94.81306+offset, 47.81679+offset), dwn_left = c(-94.81306-offset, 47.81679-offset), dwn_right = c(-94.81306+offset, 47.81679-offset))))
geoknife::download(geo_job, destination = '~/Downloads/test_subset_notarto.nc')
nc <- ncdf4::nc_open("~/Downloads/test_subset_notarto.nc")
ncvar_get(nc, "xlon")
[,1] [,2] [,3] [,4]
[1,] -95.32296 -95.31609 -95.30916 -95.30218
[2,] -94.98767 -94.97943 -94.97112 -94.96275
[3,] -94.65245 -94.64284 -94.63315 -94.62337
[4,] -94.31731 -94.30633 -94.29525 -94.28409
ncvar_get(nc, "xlat")
[,1] [,2] [,3] [,4]
[1,] 47.48399 47.71057 47.93709 48.16355
[2,] 47.47891 47.70547 47.93196 48.15841
[3,] 47.47290 47.69943 47.92591 48.15233
[4,] 47.46597 47.69247 47.91892 48.14531
~☝️ wow, the longitude cells are close together. 0.01°~ whoops, I was thinking about this incorrectly
the data shows NaNs appear in some cells but there isn't a hard edge in x or y. Not sure why these are NaN
ncvar_get(nc, "prcp_debias", start = c(1,1,1,1), count = c(-1,-1,1,-1))
[,1] [,2] [,3] [,4]
[1,] 0 0.000000000 0 0
[2,] 0 0.006079078 NaN NaN
[3,] 0 0.005200386 0 NaN
[4,] 0 0.000000000 0 0
looks like this could be because they are on the edge of a lake?
ncvar_get(nc, "xlat")[2,4]
[1] 48.15841
> ncvar_get(nc, "prcp_debias", start = c(1,1,1,1), count = c(-1,-1,1,-1))[2,4]
[1] NaN
> ncvar_get(nc, "xlon")[2,4]
[1] -94.96275
Ok, I think I'm feeling more confident in what is going on here.
From the netcdf subset, I plotted the values for Longitude on Cross Points
and Latitude on Cross Points
. I wasn't sure if those were centroids of the cells or vertices/corners of the cell, but I think our math assumes they are vertices of the cell. I plotted those values as red when they had data and grey when they were NA (as "+" markers).
Then I plotted your "blue point" (cyan here), which falls in the centroid assuming those other points are corners. But then when adding the upper left, up, right, etc from above (grey when those were NA, blue when they had data) it seems likely that our assumed centroid is the lower right corner of the NA data cell (I drew a box below):
So, I'd guess either we've made a mistake assuming the meaning of the cross points (they are centroids instead of corners) or GDP has made a mistake on their meaning (they are corners, but being treated as centroids for data processing).
perhaps offset needs to include a 0.5 width and 0.5 height?
@jread-usgs
What I'm seeing in the underlying data (cells of 0.01°E or 0.3N°) doesn't seem to line up with the cell query map
What do you mean here by 'cells of 0.01°E or 0.3N°' ? That the cells are irregular, or that they aren't square?
I deleted that comment (I think right after I posted it) because I was reading the lat/lon data incorrectly. Is that comment still up here somewhere?
I still see it with a map of the grid cells filled by number of lakes
Ha! So it was just deleted from my view of this issue!? 😆 Good to know. Anyhow, ignore that comment and the one I crossed out above (that was the same incorrect read)
~☝️ wow, the longitude cells are close together. 0.01° whoops~, I was thinking about this incorrectly
I wanted to see where are the NAs were in the raw NC files (per the recent Notaro email), but I can't seem to extract any actual values from these nc files. I keep getting an error. When I look up the error, all the resolutions seem to point to a corrupt file. I can get time and xlon/xlat in this same way. I've also double-checked the variable names I am using match those in each nc file using names(nc$var)
. Am I doing something wrong?
library(ncdf4)
# Need to be on VPN
lapply(c( "prcp", "tas", "rh", "rsds", "rsdl", "windspeed"), function(var) {
var_url <- sprintf('http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias/files/debias_%s_mri_mid21_longerdaymet_ghcn_365d.nc', var)
nc <- nc_open(var_url)
values_df <- ncvar_get(nc, sprintf("%s_debias", var), start = c(0,0,0,0), count = c(1,1,1,1))
nc_close(nc)
return(values_df)
})
Error in ncvar_get_inner(ncid2use, varid2use, nc$var[[li]]$missval, addOffset, :
C function R_nc4_get_vara_double returned error
I get the same error when I don't specify start
or count
.
start would need to be c(1,1,1,1)
for this, since ncdf4
uses 1 index (R index) instead of 0
And I think you'd need something like this @lindsayplatt
var <- "prcp"
var_url <- sprintf('http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias/files/debias_%s_mri_mid21_longerdaymet_ghcn_365d.nc', var)
nc <- nc_open(var_url)
values_df <- ncvar_get(nc, sprintf("%s_debias", var), start = c(1,1,1,1), count = c(-1,-1,1,1))
nc_close(nc)
(see the -1,-1
on the count for the lat/lon slots, allowing you to get all space, but only one year and one day. -1
means "all")
Ah ok. Still doesn't resolve the error because I get the error when I don't specify start
or count
as I noted at the end of that comment. Doing that is returning an even scarier failure. Does that code work for you? I can get lat and lon just fine:
lon <- ncdf4::ncvar_get(nc, 'xlon')
lat <- ncdf4::ncvar_get(nc, 'xlat')
Running your code prints this scary HTML output:
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!doctype^ html><html lang="en"><head><title>HTTP Status 400 – Bad Request</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 400 – Bad Request</h1><hr class="line" /><p><b>Type</b> Exception Report</p><p><b>Message</b> Invalid character found in the request target [/thredds/dodsC/notaro_debias/files/debias_prcp_mri_mid21_longerdaymet_ghcn_365d.nc.dods?prcp%5fdebias.prcp%5fdebias[0][0][0:140][0:216] ]. The valid characters are defined in RFC 7230 and RFC 3986</p><p><b>Description</b> The server cannot or will not process the request due to something that is perceived to be a client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing).</p><p><b>Exception</b></p><pre>java.lang.IllegalArgumentException: Invalid character found in the request target [/thredds/dodsC/notaro_debias/files/debias_prcp_mri_mid21_longerdaymet_ghcn_365d.nc.dods?prcp%5fdebias.prcp%5fdebias[0][0][0:140][0:216] ]. The valid characters are defined in RFC 7230 and RFC 3986 org.apache.coyote.http11.Http11InputBuffer.parseRequestLine(Http11InputBuffer.java:509) org.apache.coyote.http11.Http11ProcError in Rsx_nc4_get_vara_double: NetCDF: Access failure
essor.service(Http11Processor.java:511) org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:831) org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1673) org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) org.apache.tomcat.util.threads.ThreadPoolExecutorVar: prcp_debias Ndims: 4 Start: $Worker.run(ThreadPoolExecutor.java:659) org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) java.lang.Thread.run(Thread.java:748)</pre><p><b>Note</b> The full stack trace of the root cause is available in the server logs.</p><hr class="line" /><h3>Apache Tomcat</h3></body></html>
0,0,0,0 Count: 1,1,141,217
Error in ncvar_get_inner(ncid2use, varid2use, nc$var[[li]]$missval, addOffset, :
C function R_nc4_get_vara_double returned error
huh, yeah, it seemed to work for me:
I haven't used the direct URL method to get more complicated data like this before. Lat/lon are simpler because they don't have other dimensions and probably don't have offsets/compression. Maybe the other variables are causing issues due to that?
Updating ncdf4
from 1.18 to 1.19 to see if that fixes it
That did not fix it. Le sigh, so much for a quick check of the raw nc file
I adjusted the grid so that we assume the points in the NetCDF are the centroid and then tried querying for those missing cells again (plus some nearby cells). The centroid of the cell and the centroid of the lakes now return the same result (so I can assume this means we have appropriately reconstructed the grid locally!):
library(targets)
library(sf)
na_cells <- c(17909, 18126, 18127, 18993)
nearby_cells <- c(18128, 18343, 18344)
all_cells <- c(na_cells, nearby_cells)
query_lakeids <- tar_read(lake_cell_xwalk_df) %>% filter(cell_no %in% all_cells)
cell_geom <- tar_read(grid_cells_sf) %>%
filter(cell_no %in% all_cells) %>%
st_transform(4326)
cell_centroid <- tar_read(grid_cell_centroids_sf) %>%
filter(cell_no %in% all_cells) %>%
st_transform(4326)
# Get centroid for each of these actual lakes
lake_centroids <- readRDS("2_crosswalk_munge/out/centroid_lakes_sf.rds") %>%
filter(site_id %in% query_lakeids$site_id) %>%
left_join(select(query_lakeids, site_id, cell_no)) %>%
st_zm() %>%
st_centroid()
# Find centroid of the lake centroids per cell
centroid_of_cell_lakes <- lake_centroids %>%
bind_cols(as.data.frame(st_coordinates(lake_centroids))) %>%
st_drop_geometry() %>%
group_by(cell_no) %>%
summarize(X = mean(X), Y = mean(Y)) %>%
st_as_sf(coords = c("X", "Y"), crs = 4326)
# Construct the grid before the fix to show on the plot
original_cell_geom <- (tar_read(grid_cells_sf) %>%
filter(cell_no %in% all_cells) + 25000/2) %>%
st_as_sf(crs = "+proj=lcc +lat_0=45 + lon_0=-97 +lat_1=36 +lat_2=52 +x_0=0 +y_0=0 +ellps=WGS84 +units=m") %>%
st_transform(4326)
plot(st_geometry(cell_geom))
plot(st_geometry(lake_centroids), add=T)
plot(st_geometry(centroid_of_cell_lakes), add=T, col="red", pch=16, cex=2)
plot(st_geometry(cell_centroid), add=T, col="blue", pch=16, cex=2)
plot(st_geometry(original_cell_geom), add=TRUE, border = "grey", lwd=2)
Here is what I did to test the GDP data of these cells
library(geoknife)
query_gdp <- function(query_simplegeom) {
gcm_job <- geoknife(
stencil = query_simplegeom,
fabric = webdata(
url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_ipsl",
variables = c("prcp_debias"),# "tas_debias", "rh_debias", "rsds_debias", "rsdl_debias", "windspeed_debias"),
times = c("1981-01-01 00:00:00", "1981-02-01 00:00:00")
)
# The default knife algorithm is appropriate
)
wait(gcm_job)
data_out <- result(gcm_job, with.units = TRUE)
return(data_out)
}
sf_pts_to_simplegeom <- function(sf_obj, id_col = "cell_no") {
sf_obj %>%
st_coordinates() %>%
t() %>% as.data.frame() %>%
# Include the cell numbers in the GDP geom so that the
# results can be linked to the correct cells
setNames(sf_obj[[id_col]]) %>%
simplegeom()
}
cell_centroids_data <- query_gdp(sf_pts_to_simplegeom(cell_centroid)) %>%
mutate(date = as.Date(DateTime) + `time(day of year)`) %>%
select(date, all_of(as.character(seq_along(cell_centroid$cell_no))))
names(cell_centroids_data) <- c("date", cell_centroid$cell_no)
centroid_of_cell_lakes_data <- query_gdp(sf_pts_to_simplegeom(centroid_of_cell_lakes)) %>%
mutate(date = as.Date(DateTime) + `time(day of year)`) %>%
select(date, all_of(as.character(seq_along(centroid_of_cell_lakes$cell_no))))
names(centroid_of_cell_lakes_data) <- c("date", centroid_of_cell_lakes$cell_no)
cell_centroids_data_summary <- cell_centroids_data %>%
pivot_longer(-date, names_to = "cell_no") %>%
mutate(cell_no = as.numeric(cell_no)) %>%
group_by(cell_no) %>%
summarize(n = n(),
n_na = sum(is.na(value)),
percent_na = n_na/n)
cell_centroids_data_summary_sf <- cell_centroid %>% left_join(cell_centroids_data_summary)
centroid_of_cell_lakes_data_summary <- centroid_of_cell_lakes_data %>%
pivot_longer(-date, names_to = "cell_no") %>%
mutate(cell_no = as.numeric(cell_no)) %>%
group_by(cell_no) %>%
summarize(n = n(),
n_na = sum(is.na(value)),
percent_na = n_na/n)
centroid_of_cell_lakes_data_summary_sf <- centroid_of_cell_lakes %>% left_join(centroid_of_cell_lakes_data_summary)
ggplot() +
geom_sf(data = cell_centroids_data_summary_sf, aes(color = percent_na), size=2) +
geom_sf(data = centroid_of_cell_lakes_data_summary_sf, aes(color = percent_na), size=2) +
scico::scale_fill_scico(palette = "batlow", direction = -1) +
geom_sf(data = cell_geom, fill=NA, size = 1) +
# geom_sf_label(data = cell_geom, aes(label = cell_no)) +
theme_void()
# Values for centroid of cell & centroid of lakes within the cell give the same exact results!
# Choosing a cell that has data and one that doesn't have any
> centroid_of_cell_lakes_data %>% select(`17909`) %>% summary()
17909
Min. : 0.0000
1st Qu.: 0.0000
Median : 0.4505
Mean : 2.4753
3rd Qu.: 1.4072
Max. :78.8974
> cell_centroids_data %>% select(`17909`) %>% summary()
17909
Min. : 0.0000
1st Qu.: 0.0000
Median : 0.4505
Mean : 2.4753
3rd Qu.: 1.4072
Max. :78.8974
> centroid_of_cell_lakes_data %>% select(`18344`) %>% summary()
18344
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :365
> cell_centroids_data %>% select(`18344`) %>% summary()
18344
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :365
Plot of percent NAs returned by either the centroid of the cell or centroid of the lake.
A before grid shift fix and after grid shift fix look into NA values for grid cells. Seems to have added more missing data than before?
# Function to download one month of data for all MN cells
download_and_plot_data <- function(mn_centroids_sf, mn_cells_sf, title) {
query_gdp <- function(query_simplegeom) {
gcm_job <- geoknife(
stencil = query_simplegeom,
fabric = webdata(
url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_ipsl",
variables = c("prcp_debias", "tas_debias", "rh_debias", "rsds_debias", "rsdl_debias", "windspeed_debias"),
times = c("1981-01-01 00:00:00", "1981-02-01 00:00:00")
)
# The default knife algorithm is appropriate
)
wait(gcm_job)
data_out <- result(gcm_job, with.units = TRUE)
return(data_out)
}
# Call this inside to overwrite the function loaded by the `source()` call for `get_query_cells()`
sf_pts_to_simplegeom <- function(sf_obj, id_col = "cell_no") {
sf_obj %>%
st_coordinates() %>%
t() %>% as.data.frame() %>%
# Include the cell numbers in the GDP geom so that the
# results can be linked to the correct cells
setNames(sf_obj[[id_col]]) %>%
simplegeom()
}
all_MN_cell_data <- query_gdp(sf_pts_to_simplegeom(mn_centroids_sf, id_col = "site_id"))
all_MN_cell_data_formatted <- all_MN_cell_data %>%
select(DateTime, all_of(as.character(seq_along(mn_centroids_sf$cell_no))))
names(all_MN_cell_data_formatted) <- c("DateTime", mn_centroids_sf$cell_no)
# Summarize the month by counting how many NAs
all_MN_cell_data_summary <- all_MN_cell_data_formatted %>%
pivot_longer(-DateTime, names_to = "cell_no") %>%
group_by(cell_no) %>%
summarize(na_count = sum(is.na(value)),
percent_na = na_count/n()) %>%
mutate(cell_no = as.numeric(cell_no))
all_MN_cell_data_summary_sf <- mn_cells_sf %>%
left_join(all_MN_cell_data_summary, by = "cell_no")
mn_sf <- st_as_sf(maps::map('state', 'minnesota', fill=TRUE, plot=FALSE))
print(ggplot() +
geom_sf(data = all_MN_cell_data_summary_sf, aes(fill = percent_na)) +
scico::scale_fill_scico(palette = "batlow", direction = -1) +
geom_sf(data = mn_sf, fill=NA, size = 1) +
theme_void()+
ggtitle(title))
}
# Construct the grid before the fix to show on the plot
source("7_drivers_munge/src/GCM_driver_utils.R") # Load `get_query_cells()`
original_cell_geom <- (tar_read(grid_cells_sf) + 25000/2) %>%
st_as_sf(crs = "+proj=lcc +lat_0=45 + lon_0=-97 +lat_1=36 +lat_2=52 +x_0=0 +y_0=0 +ellps=WGS84 +units=m")
all_MN_cell_nos_orig <- get_query_cells(original_cell_geom, tar_read(query_lake_centroids_sf)) %>% pull(cell_no)
all_MN_cells_orig <- original_cell_geom %>% filter(cell_no %in% all_MN_cell_nos_orig) %>% st_transform(4326)
all_MN_cell_centroids_orig <- st_centroid(all_MN_cells_orig)
download_and_plot_data(all_MN_cell_centroids_orig, all_MN_cells_orig, "Before fix")
# Construct the grid with the current fix
all_MN_cells <- tar_read(grid_cells_sf) %>% filter(cell_no %in% tar_read(query_cells)) %>% st_transform(4326)
all_MN_cell_centroids <- st_centroid(all_MN_cells)
download_and_plot_data(all_MN_cell_centroids, all_MN_cells, "After fix")
The difference in missing cells do make sense given where the lakes are and how the new grid lines up:
mn_sf <- st_as_sf(maps::map('state', 'minnesota', fill=TRUE, plot=FALSE))
lake_centroids <- readRDS("2_crosswalk_munge/out/centroid_lakes_sf.rds") %>%
st_intersection(mn_sf)
# BEFORE THE FIX
plot(st_geometry(mn_sf), main = "Before fix")
plot(st_geometry(lake_centroids), add=T, col="grey")
plot(st_geometry(all_MN_cells_orig), add=T)
# AFTER THE FIX
plot(st_geometry(mn_sf), main="After fix")
plot(st_geometry(lake_centroids), add=T, col="grey")
plot(st_geometry(all_MN_cells), add=T)
Also matches with what the raw gridded data show - this is from the subset Jordan pulled with
nccopy http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_mri?time[0:1:1],year[0:1:1],jx[0:1:216],iy[0:1:140],prcp_debias[0:1:1][0:1:1][0:1:140][0:1:216] TOSS_NETCDF_NOTARO.nc
I am going to pick up here with dates next week (this happens in the munge step, so not difficult to do post-download).
# What is probably happening is that the GCMs don't acknowledge
# leap years but when we turn `DateTime` & `time(day of year)`
# into a date, it thinks there are leap days, which is why each
# year after a leap year is one day short.
munged_data <- tar_read(glm_ready_gcm_data_feather)[1] %>%
arrow::read_feather() %>%
# Use only one cell to investigate date issues
filter(cell == 13782) %>%
mutate(is_leapyear = lubridate::leap_year(time),
year = as.numeric(format(time, "%Y")))
munged_data %>%
mutate(mon_day = format(time, "%m-%d")) %>%
filter(mon_day == "02-29")
munged_data %>%
group_by(year) %>%
summarize(n = n(),
is_leap = unique(is_leapyear))
# A tibble: 60 x 3
year n is_leap
<dbl> <int> <lgl>
1 1981 365 FALSE
2 1982 365 FALSE
3 1983 365 FALSE
4 1984 366 TRUE
5 1985 364 FALSE
6 1986 365 FALSE
7 1987 365 FALSE
8 1988 366 TRUE
9 1989 364 FALSE
10 1990 365 FALSE
# ... with 50 more rows
The leap year issue is more complicated than I thought. For some reason, the DateTime column has bizzar times and 1985 has a start date of 1984-12-31 23:15:03
. Not sure why these times are here? I had been assuming times were all 00:00:00
. I was going to use the values of DateTime
to figure out if it was a leap year or not, but feels a bit weird to do that when the start of 1985 is actually 1984-12-31 23:15:03
.
Here are all the unique ones that we add time(day of year)
to in order to get the actual date.
Hmm, that is odd. That DateTime
field is returned by GDP and then saved in the raw feather files? I'm just wondering where that comes from, since the raw netCDF files only have a time
field, that is day of year, and a year
field:
Yes, the DateTime
column is what is returned from GDP.
I have a solution but wowza ... it was pretty complicated. I end up with 365 records per year, which is exactly what we want
Just wanted to weigh in on the "DateTime comes from GDP"
This is the raw result from GDP. Perhaps something is funky with result
if you are getting those bad dates?
That's what I get for the first year, too. What does your date next to 1984 look like?
Yuck!
Yeah :(
Redid the plots from https://github.com/USGS-R/lake-temperature-model-prep/issues/273#issuecomment-1027320572 with the new grid + my fix for leap year dates. However, the Winslow data and these still treated leap years differently so there was a mismatch but ONLY for leap years. I did the set of plots with all the dates included and then one where I removed any data from leap years. Feeling confident about the grid we've got going on now.
10% 50% 90%
-0.2699306 0.0000000 0.0000000
10% 50% 90%
0 0 0
Hmmm although looking at the mismatch between my fixed data and the Winslow data shows that my fix isn't quite working for all the years. I now have Dec 31 for all leap years in our dataset, BUT we shouldn't have NAs for 03-01
, only for 02-29
. A bit more tweaking needed ...
David has those available now on THREDDS. Fixes the data but also adds longwave radiation, which will need to be a new variable added to our workflow.
glm_ready_gcm_data_feather
sometimes haveNA
s and sometimes haveNaN
s. How come? This causes an issue with building our NetCDF file, so we added a fix that convertsNA
s toNaN
s to get put into the NetCDF file.