DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

Switch to debiased GCMs #273

Closed lindsayplatt closed 2 years ago

lindsayplatt commented 2 years ago

David has those available now on THREDDS. Fixes the data but also adds longwave radiation, which will need to be a new variable added to our workflow.

lindsayplatt commented 2 years ago

Here are where the new NC files are for this data. The variables available here are already formatted and in the units we would use for GLM, which makes the recent unit conversion effort a bit moot. See

lindsayplatt commented 2 years ago

Don't use the snow variable - still calculate because that one looks like snow cover. Will need to convert precip from mm/day to m/day. Also, this might be daily data (!!!) which will hopefully translate into performance improvements for #258

lindsayplatt commented 2 years ago

Grid seems to be slightly different than the other one because my query cell is not returning any data. I can get data if I ask for just WI, so might just need to make slight adjustments. David suggested looking at https://github.com/USGS-R/necsc-lake-modeling for help to get the new grid right.

gcm_job <- geoknife(
  stencil = webgeom('state::Wisconsin'),
  fabric = webdata(
    url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_mri",
    variables = c(
      'windspeed_debias', 
      'tas_debias', # air temp
      'rsds_debias', #shortwave radiation
      'rsdl_debias', # longwave radiation
      'rh_debias', # relative humidity
      'prcp_debias'), # precip
    times = c('2040-01-01', '2040-01-10')
  )
  # The default knife algorithm is appropriate
)
wait(gcm_job)
gcm_job_status <- check(gcm_job)$statusType

my_data <- result(gcm_job, with.units = TRUE)
lindsayplatt commented 2 years ago

This one and #258 are kind of blurring now because these new debiased GCMs are in daily data, which makes this a much faster download process. I was having some trouble with our reconstructed grid to work just yet, so I did some testing with just a grid over WI.

First, I wanted to test out the different download times for the GCMs and different cell sizes. This first test was only for different GCMs and numbers of cells per query (not all the time periods) BUT I would say even multiplying this by 3 to get a sense of how long it would take if we did all 3 projection periods puts us way ahead of the 16 hours from my test with 5 lakes (4 different queries per GCM + projection period). Also, there doesn't seem to be too much difference based on the number of cells per query.

The function to run tests through

# This function saves two files - the actual data file (wanted to run times to reflect this step) and
# the run results with three columns `gcm`, `ncells`, and `dur_sec`. I save this out just in case one
# fails so I can still look at the results and not have a bunch of time wasted.
run_test <- function(geom_sf, ncells, gcm, projection_period) {

  set.seed(18)
  query_cells_sf <- geom_sf[sample(1:nrow(geom_sf), ncells),]

  sf_pts_to_simplegeom <- function(sf_obj) {
    sf_obj %>%
      st_coordinates() %>%
      t() %>% as.data.frame() %>%
      # Include the cell numbers in the GDP geom so that the
      # results can be linked to the correct cells
      setNames(sf_obj$cell_no) %>%
      simplegeom()
  }
  query_geom_WGS84 <- sf::st_transform(query_cells_sf, crs = 4326)
  query_simplegeom <- sf_pts_to_simplegeom(query_geom_WGS84)

  query_dates <- switch(
    as.character(projection_period),
    '1980_1999' = c('1980-01-01', '1999-12-31'),
    '1980_2099' = c('1980-01-01', '2099-12-31')
  )

  fp <- tryCatch({
    query_stats <- system.time({
      gcm_job <- geoknife(
        stencil = query_simplegeom,
        fabric = webdata(
          url = sprintf("http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_%s", gcm),
          variables = c(
            'windspeed_debias', 
            'tas_debias', # air temp
            'rsds_debias', #shortwave radiation
            'rsdl_debias', # longwave radiation
            'rh_debias', # relative humidity
            'prcp_debias'), # precip
          times = query_dates
        )
        # The default knife algorithm is appropriate
      )
      wait(gcm_job)
      gcm_job_status <- check(gcm_job)$statusType

      my_data <- result(gcm_job, with.units = TRUE)
      out_file <- sprintf("7_drivers_munge/test/new_%s_%s_%s.feather", gcm, projection_period, ncells)
      arrow::write_feather(my_data, out_file)
    })
    dur_sec <- query_stats[["elapsed"]]
    fp <- sprintf('7_drivers_munge/test/new_run_results_%s_%s_%s.feather', gcm, projection_period, ncells)
    arrow::write_feather(tibble(gcm, ncells, dur_sec), fp)
    return(fp)
  }, error = function(e) return(""))

  return(fp)
}

Test 1: Comparing times across GCMs and number of cells for just the 1980-1999 period

library(tidyverse)
library(sf)
library(geoknife)

# Test grid over all of WI 
wi_sf <- maps::map('state', 'wi', plot=FALSE, fill=TRUE) %>% st_as_sfc()
wi_grid_sfc <- st_make_grid(wi_sf)
wi_grid_sf <- st_sf(data.frame(cell_no = 1:length(wi_grid_sfc)), geometry=wi_grid_sfc)
wi_grid_centroids <- st_centroid(wi_grid_sf)

# Set up the different tests
test_criteria <- expand.grid(
  ncells = c(1,5,10,20), 
  gcm = c('access', 'cnrm', 'gfdl', 'ipsl', 'mri', 'miroc5'),
  projection = c('1980_1999'))

# Run all the tests (saves those two files per run)
pmap_dfr(test_criteria, run_test, geom_sf = wi_grid_centroids)

# There were a few issues with errors near the end that had me re-run just
# a couple of the tests by subsetting `test_criteria` in the above command
# and re-running.

# Save out a single file summarizing all the run results
purrr::map(list.files('7_drivers_munge/test', 'new_run_results_', full.names = T), 
           arrow::read_feather) %>% 
  bind_rows() %>% 
  write_csv("7_drivers_munge/test/new_run_results.csv")

Then, we can look at the results (using the filename pattern)

test_results1 <- read_csv("7_drivers_munge/test/new_run_results.csv") 

ggplot(test_results1, aes(ncells, dur_sec)) + 
  geom_point(size = 2) + 
  facet_grid(gcm~.)

# Total of ~ 75 min
(test_results1 %>% pull(dur_sec) %>% sum())/60

# Capturing the 24 rows of results here for reproducibility
as.data.frame(test_results1)
      gcm ncells dur_sec
1  access      1   39.46
2  access     10   91.69
3  access     20  241.49
4  access      5   68.68
5    cnrm      1  161.36
6    cnrm     10  174.69
7    cnrm     20  280.71
8    cnrm      5  210.16
9    gfdl      1  167.92
10   gfdl     10  167.36
11   gfdl     20  278.96
12   gfdl      5  217.14
13   ipsl      1  169.46
14   ipsl     10  165.11
15   ipsl     20  247.40
16   ipsl      5  213.89
17 miroc5      1  172.96
18 miroc5     10  180.81
19 miroc5     20  273.67
20 miroc5      5   80.12
21    mri      1  167.31
22    mri     10  179.22
23    mri     20  250.75
24    mri      5  214.74

image

Test 2: Try the full time period (these are all able to be queried in one call with the new GCMs now!

Want to compare what the time would be like to pull from 1980-2099 as opposed to doing each separately.

# Use the same spatial setup as before

library(tidyverse)
library(sf)
library(geoknife)

# Test grid over all of WI 
wi_sf <- maps::map('state', 'wi', plot=FALSE, fill=TRUE) %>% st_as_sfc()
wi_grid_sfc <- st_make_grid(wi_sf)
wi_grid_sf <- st_sf(data.frame(cell_no = 1:length(wi_grid_sfc)), geometry=wi_grid_sfc)
wi_grid_centroids <- st_centroid(wi_grid_sf)

# Set up the different tests
test_criteria2 <- expand.grid(
  ncells = c(1,5,10,20), 
  gcm = c('cnrm'),
  projection_period = c('1980_2099'))

# Run the tests
test_results2 <- pmap_dfr(test_criteria2, run_test, geom_sf = wi_grid_centroids)

# Write out the results
purrr::map(list.files('7_drivers_munge/test', 'new_run_results_*1980_2099', full.names = T), 
           arrow::read_feather) %>% 
  bind_rows() %>% 
  write_csv("7_drivers_munge/test/new_run_results_full_period.csv")

Now, check out the results and compare to the 1980-1999 times (TBD ... waiting for build but don't want to lose the stuff above)

# File sizes are definitely getting big with the full time period:
filemetadata <- as.data.frame(file.info(list.files('7_drivers_munge/test', 'new_cnrm_1980_2099', full.names=T)))
filemetadata <- filemetadata %>% 
  mutate(path = rownames(filemetadata)) %>% 
  mutate(size_mb = round(size / 10^6, 2)) %>% 
  select(path, size_mb) %>% arrange(size_mb)
rownames(filemetadata) <- NULL 
filemetadata

                                                path size_mb
1  7_drivers_munge/test/new_cnrm_1980_2099_1.feather    2.57
2  7_drivers_munge/test/new_cnrm_1980_2099_5.feather    5.44
3 7_drivers_munge/test/new_cnrm_1980_2099_10.feather    8.33
4 7_drivers_munge/test/new_cnrm_1980_2099_20.feather   14.05

# Load the result files and combine with the previous results (but only keep the GCM that matches)
test_results2 <- read_csv("7_drivers_munge/test/new_run_results_full_period.csv") 
test_results_together <- test_results2 %>% 
  mutate(query_dates = '1980_2099') %>%
  bind_rows(mutate(test_results1, query_dates = '1980_1999')) %>% 
  filter(gcm == "cnrm")

ggplot(test_results_together, aes(ncells, dur_sec, color = query_dates)) + 
  geom_point(size = 2) + 
  facet_grid(gcm~.)

test_results_compare <- test_results_together %>% 
  pivot_wider(names_from = query_dates, values_from = dur_sec) %>% 
  arrange(ncells) %>% 
  mutate(relationship = round(`1980_2099` / `1980_1999`, 2))

test_results_compare 

# A tibble: 4 x 5
  gcm   ncells `1980_2099` `1980_1999` relationship
  <chr>  <dbl>       <dbl>       <dbl>        <dbl>
1 cnrm       1        133.        161.         0.83
2 cnrm       5        571.        210.         2.72
3 cnrm      10        560.        175.         3.21
4 cnrm      20        543.        281.         1.93

# It took 30 min to run 4 queries on the full time period.
round(sum(test_results_compare$`1980_2099`)/60)

# It took 14 min to run 4 queries on the first projection period alone
round(sum(test_results_compare$`1980_1999`)/60)

image

Some thoughts

Running the full time period (1980-2099) takes about double the time to run just one of the three time periods alone. This is great because that means that running the full time period at once would take ~2/3 the time as running each period individually. The downside to this is that file sizes are much much larger (the feather file for 20 cells with only the first period is 4.3 MB but the feather file for 20 cells with the full period is 13.4 MB). As we scale to more lakes/cells, this could be a problem.

lindsayplatt commented 2 years ago

Some weirdness with the dates:

library(geoknife)
gcm_job <- geoknife(
  stencil = webgeom('state::Wisconsin'),
  fabric = webdata(
    url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_mri",
    variables = c(
      'windspeed_debias', 
      'tas_debias', # air temp
      'rsds_debias', #shortwave radiation
      'rsdl_debias', # longwave radiation
      'rh_debias', # relative humidity
      'prcp_debias'), # precip
    times = c('1980-01-01', '2099-12-31')
  )
  # The default knife algorithm is appropriate
)
wait(gcm_job)
gcm_job_status <- check(gcm_job)$statusType

my_data <- result(gcm_job, with.units = TRUE)

summary(my_data$DateTime)

                 Min.               1st Qu.                Median 
"1981-01-01 00:00:00" "1995-10-02 07:44:17" "2049-07-01 19:13:16" 
                 Mean               3rd Qu.                  Max. 
"2043-03-02 19:13:16" "2084-04-01 06:42:14" "2098-12-31 14:26:33"

The min is weirdly Jan 1 1980 and the max is Dec 31 2098. I was expecting 1980 as the min and 2099 as the max.

jordansread commented 2 years ago

I think if you leave times as NA you would get the whole time range of the dataset, but I'm also surprised by your result here.

lindsayplatt commented 2 years ago

Figuring out the new grid params:

nc <- ncdf4::nc_open('http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias/files/debias_windspeed_mri_mid21_longerdaymet_ghcn_365d.nc') 

unique(diff(ncdf4::ncvar_get(nc, 'jx'))) # For cellsize
unique(diff(ncdf4::ncvar_get(nc, 'iy'))) # Confirming cellsize

min(ncdf4::ncvar_get(nc, 'jx')) # For xmin
min(ncdf4::ncvar_get(nc, 'iy')) # For ymin
length(ncdf4::ncvar_get(nc, 'jx')) # For nx
length(ncdf4::ncvar_get(nc, 'iy')) # For ny

Using those results, I will be editing the grid params to this tibble:

tibble(
    crs = "+proj=lcc +lat_0=45 + lon_0=-97 +lat_1=36 +lat_2=52 +x_0=0 +y_0=0 +ellps=WGS84 +units=m",
    cellsize = 25000,
    xmin = -2700000,
    ymin = -1750000,
    nx = 217,
    ny = 141
)
lindsayplatt commented 2 years ago

I think the DateTime column is the first day of the year and then you have to add the time (day of year) column to it to get the real dates ...

head(raw_data)
    DateTime      16188     16402    variable statistic time(day of year)
1 1981-01-01 0.00000000 0.0000000 prcp_debias      MEAN                 0
2 1981-01-01 0.00000000 0.0230931 prcp_debias      MEAN                 1
3 1981-01-01 3.36814360 3.8792896 prcp_debias      MEAN                 2
4 1981-01-01 1.70109210 2.6726415 prcp_debias      MEAN                 3
5 1981-01-01 0.01876353 0.5939542 prcp_debias      MEAN                 4
6 1981-01-01 1.41612720 0.7077347 prcp_debias      MEAN                 5

> length(which(as.Date(raw_data$DateTime) == "1981-01-01"))
[1] 2190
> 2190/6 # How many Jan 1 1981s per GCM?
[1] 365
lindsayplatt commented 2 years ago

Yeah ... that is definitely the case:

raw_data %>% 
    mutate(date = as.Date(DateTime) + `time(day of year)`) %>% 
    select(DateTime, `time(day of year)`, date) %>% 
    summary()

   DateTime                   time(day of year)      date           
 Min.   :1981-01-01 00:00:00   Min.   :  0       Min.   :1981-01-01  
 1st Qu.:1995-10-02 07:44:17   1st Qu.: 91       1st Qu.:1995-12-31  
 Median :2049-07-01 19:13:16   Median :182       Median :2049-12-30  
 Mean   :2043-03-02 19:13:16   Mean   :182       Mean   :2043-08-31  
 3rd Qu.:2084-04-01 06:42:14   3rd Qu.:273       3rd Qu.:2084-12-29  
 Max.   :2098-12-31 14:26:33   Max.   :364       Max.   :2099-12-30  

Not sure why we aren't getting 2099-12-31 as the max date though? Is that a timezone thing?

lindsayplatt commented 2 years ago

I was just able to run the pipeline accessing the new GCMs. For two tiles x 6 GCMs x all time at once, there were 12 total queries and that took 60 min. The longest query took 7 min, which is a far cry from my test where the longest individual query took 40 min (not to mention the number of queries are cut by 3 times thanks to doing all time periods at once).

# Used the function from this comment to run these
# https://github.com/USGS-R/lake-temperature-model-prep/issues/258#issuecomment-999763969

> # How long did the GDP download take and how big are the resulting files?
> tar_load(gcm_data_raw_feather)
> target_metadata_download <- summarize_target_build(names(gcm_data_raw_feather))
-- DURATION --
Shortest: 3.106 min
Longest: 7.651 min
Total: 60.677 min

-- FILE SIZES --
Smallest: 3.515 mb
Biggest: 3.562 mb
Total: 42.517 mb

> 
> # How long did the munge take and how big are the resulting files?
> tar_load(glm_ready_gcm_data_feather)
> target_metadata_munge <- summarize_target_build(names(glm_ready_gcm_data_feather))
-- DURATION --
Shortest: 0.012 min
Longest: 0.132 min
Total: 0.288 min

-- FILE SIZES --
Smallest: 1.992 mb
Biggest: 2.038 mb
Total: 24.23 mb
lindsayplatt commented 2 years ago

Running this using a rudimentary match to MN just to do a slight scaling up. Will see what results are on Monday! 9 tiles with 363 cells total. The total number of lakes is 14,721.

tar_load(query_cells_centroids_list_by_tile)
lapply(query_cells_centroids_list_by_tile, nrow) %>% unlist() %>% unname()
69  24   3  12 100  26  17  69  43

lake_cell_xwalk_df %>% group_by(cell_no) %>% summarize(n = n()) %>% arrange(desc(n))
# A tibble: 363 x 2
   cell_no     n
     <int> <int>
 1   17696   238
 2   16387   227
 3   16610   202
 4   18137   189
 5   16604   187
 6   17042   181
 7   16821   176
 8   15954   172
 9   17255   163
10   18138   161
# ... with 353 more rows
lindsayplatt commented 2 years ago

Using the more sophisticated method from #278, there are 14,678 MN lakes (which is actually slightly fewer than my previous approach). Note that the query_cells_centroids_list_by_tile is the same, so it doesn't change the cells we are querying.

lindsayplatt commented 2 years ago

Currently running the downloads with the GDP fix noted here. For MN, we have 9 different tiles with varying numbers of cells in each. Each tile = the spatial limits for one GDP query. With 9 tiles and 6 different GCMs (pulling all dates at once), we end up with 54 total queries to GDP.

# Merge lake counts per cell with the tile and cell centroid spatial info
library(targets)
library(sf)
library(tidyverse)

tar_load(query_cells_info_df)
tar_load(query_cells_centroids_list_by_tile)

tile_list_w_counts <- purrr::map(query_cells_centroids_list_by_tile, function(x) {
  left_join(x, dplyr::select(query_cells_info_df, -x, -y), by = "cell_no")
}) %>% bind_rows()

# Make some maps
# 9 different tiles * 6 GCMs = 54 total queries for all of MN
tile_list_w_counts %>% 
  dplyr::mutate(tile_no = as.character(tile_no)) %>%
  ggplot(aes(color = tile_no)) + geom_sf(size = 2) + theme_void() +
  ggtitle("Cell centroids by tile (tile = one GDP query)")

tile_list_w_counts %>% 
  ggplot(aes(color = n_lakes)) + geom_sf(size = 2) + 
  scico::scale_color_scico(palette = 'batlow', direction = -1) + 
  theme_void()+
  ggtitle("Cell centroids by number of lakes per cell")

image

image

jordansread commented 2 years ago

I think my comment is directed at the future multi-state pull, since it is probably unnecessary optimization if aimed at MN (unless you want to try to see about time-costs of the queries).

Given this

9 tiles with 363 cells total

I'd think all of those cells could be snagged in a single bigger tile (or two, since you don't really want you cell layout to be optimized only for the state of MN...), reducing your number of GDP queries by 9x. Unless you get into the territory of 1000 cells in a query, you should still be able to do these in one call. Unless the pivot table operations are the limiting factor for number of cells.

hcorson-dosch-usgs commented 2 years ago

Just a note - we're moving ahead on the netCDF PR #281 for now, so that I can work on incorporating those netCDFs into the lake-temperature-process-models pipeline, but we will potentially need to revise the netCDF code depending on the resolution of this issue (e.g., if the grid mapping has in fact changed, or if those cells shouldn't have NA values).

lindsayplatt commented 2 years ago

Verifying the projection. Michael Notaro sent us this rcm_map variable, which is the same as the one from the un-debiased data, which would mean that we are already using the appropriate projection. Validating further just to make sure by 1) plotting air temp for one lake in our current dataset to that of the same lake in the Winslow 2017 data release, and 2) using the un-projected coordinates to make a map of the grid bounding box.

rcm_map variable:

char rcm_map ; rcm_map:grid_mapping_name = "lambert_conformal_conic" ; rcm_map:standard_parallel = 36., 52. ; rcm_map:longitude_of_central_meridian = -97. ; rcm_map:latitude_of_projection_origin = 45. ; rcm_map:_CoordinateTransformType = "Projection" ; rcm_map:_CoordinateAxisTypes = "GeoX GeoY" ;

1. Plotting air temp from two GCM sources for the same lake

Quantiles of difference between AirTemp (negative = Winslow was higher, positive = recent pull was higher). About 80% of the data is within +/- 3 deg

      10%       50%       90% 
-2.956229  0.199418  3.180503 

blue = 1:1 line, red = 10th and 90th percentiles

image

blue = 0 (no difference), red = 10th and 90th percentiles

image

2. Mapping xlat/xlon values from NetCDF file vs the projected coordinates

Using the lat/long:

image

Using the projected coordinates:

image

Code used to produce figures

Code for first two plots (you need to have already built the targets pipeline locally)

library(targets)
library(tidyverse)
library(sbtools)

local_dir <- "7_drivers_munge/test"
if(!dir.exists(local_dir)) dir.create(local_dir)
gcm_name <- "GFDL"

# Downloaded and unzipped gcm from this ScienceBase item locally:
# https://www.sciencebase.gov/catalog/item/57dc37b0e4b090824ffd5663
gcm_local <- file.path(local_dir, gcm_name)
gcm_local_files <- list.files(gsub(".zip", "", gcm_local))

# Define the lake
winslow_lake_to_eval <- "nhd_1099030"
# Got this xwalk file from Hayley
winslow_to_nhdhr_xwalk <- read_csv(file.path(local_dir, "winslow_nhdhr_xwalk.csv"))

# Figure out which nhdhr lake id is equivalent to the Winslow one and then identify the cell
nhdhr_id <- winslow_to_nhdhr_xwalk %>% filter(WINSLOW_ID == winslow_lake_to_eval) %>% pull(site_id)
targets::tar_load(lake_cell_xwalk_df)
cell_of_lake_to_eval <- lake_cell_xwalk_df %>% filter(site_id == nhdhr_id) %>% pull(cell_no)

old_gcm_airtemp <- read_csv(gcm_local_files[grepl(winslow_lake_to_eval, gcm_local_files)]) %>% 
  select(time, AirTemp) %>% rename(AirTemp_Winslow = AirTemp)

# Use the feather files that we created per cell & combine all three timeperiods
cell_feathers <- targets::tar_read(out_skipnc_feather)
cell_feathers_gcm <- cell_feathers[grepl(gcm_name, cell_feathers)]
cell_feathers_gcm_cell <- cell_feathers_gcm[grepl(sprintf("%s.feather", cell_of_lake_to_eval), cell_feathers_gcm)]
new_gcm_airtemp <- purrr::map(cell_feathers_gcm_cell, arrow::read_feather) %>% 
  bind_rows() %>% select(time, AirTemp) %>% rename(AirTemp_Recent = AirTemp)

all_gcm_airtemp <- full_join(old_gcm_airtemp, new_gcm_airtemp)

# Some weirdness with mismatched leap years in the new data still
all_gcm_airtemp %>% filter(is.na(AirTemp_Winslow) | is.na(AirTemp_Recent)) %>% View()

ggplot(all_gcm_airtemp, aes(AirTemp_Winslow, AirTemp_Recent)) + 
  geom_point(shape = 1) +
  geom_abline(slope=1, intercept = 0, color = "blue", size = 2) +
  geom_quantile(color = "red", quantiles = c(0.10, 0.90)) +
  ggtitle("AirTemp", subtitle = sprintf("Lake %s (Winslow: %s)", nhdhr_id, winslow_lake_to_eval))

all_gcm_airtemp_diff <- all_gcm_airtemp %>% 
  mutate(AirTemp_diff = AirTemp_Recent - AirTemp_Winslow)

# Look at quantile of differences. ~ 80% of data is within +/- 3 deg 
quantile(all_gcm_airtemp_diff$AirTemp_diff, probs = c(0.10, 0.50, 0.90), na.rm = TRUE)
#       10%       50%       90% 
# -2.956229  0.199418  3.180503 

ggplot(all_gcm_airtemp_diff, aes(time, AirTemp_diff)) + 
  geom_point(shape = 1) +
  geom_hline(yintercept = 0, color = "blue", size = 2) +
  geom_quantile(color = "red", quantiles = c(0.10, 0.90)) +
  ggtitle("AirTemp, difference between Winslow and recent", subtitle = sprintf("Lake %s (Winslow: %s)", nhdhr_id, winslow_lake_to_eval))

Code for the map (need to be on VPN for nc_open() to work)

library(sf)
library(ncdf4)

nc_proj_str <- "+proj=lcc +lat_0=45 + lon_0=-97 +lat_1=36 +lat_2=52 +x_0=0 +y_0=0 +ellps=WGS84 +units=m"
conus_sf <- st_as_sf(maps::map('state', plot=FALSE, fill=TRUE))

nc <- nc_open('http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias/files/debias_windspeed_mri_mid21_longerdaymet_ghcn_365d.nc') 
xmin <- min(ncvar_get(nc, 'xlon')) # For xmin
xmax <- max(ncvar_get(nc, 'xlon')) # For xmax
ymin <- min(ncvar_get(nc, 'xlat')) # For ymin
ymax <- max(ncvar_get(nc, 'xlat')) # For ymax

jxmin <- min(ncvar_get(nc, 'jx')) # For projected xmin
jxmax <- max(ncvar_get(nc, 'jx')) # For projected xmax
iymin <- min(ncvar_get(nc, 'iy')) # For projected ymin
iymax <- max(ncvar_get(nc, 'iy')) # For projected ymax

nc_close(nc)

latlon_extent <- expand.grid(x = c(xmin, xmax), y = c(ymin, ymax))
projected_extent <- expand.grid(x = c(jxmin, jxmax), y = c(iymin, iymax))

nc_bounds_sf <- st_as_sf(latlon_extent, coords = c("x", "y"), crs = 4326)
nc_bounds_proj_sf <- st_as_sf(projected_extent, coords = c("x", "y"), crs = nc_proj_str)

nc_bbox_sf <- st_as_sfc(st_bbox(nc_bounds_sf))
nc_bbox_proj <- st_as_sfc(st_bbox(nc_bounds_proj_sf))

plot(st_geometry(nc_bounds_sf), pch=20, col="red")
plot(st_geometry(nc_bbox_sf), add=TRUE)
plot(st_geometry(conus_sf), add=TRUE)

plot(st_geometry(nc_bounds_proj_sf), pch=20, col="red")
plot(st_geometry(nc_bbox_proj), add=TRUE)
plot(st_geometry(st_transform(conus_sf, nc_proj_str)), add=TRUE)
lindsayplatt commented 2 years ago

Starting looking into the 4 missing cells because I was wondering whether or not they had anything to do with the weird projection stuff. I don't think so, but here is something weird. If I look at just one of the cells, 17909, and query the cell centroid (shown by the blue dot below), I get NaN values. If instead I query the centroid of the lakes (the open black circles) within this cell (shown by the red dot below), I don't get any NaNs. So, I'm not quite sure what is happening with GDP here. Perhaps the grids are actually smaller than we thought they were? I would expect both of these to return the same data if they are within the same cell. @jread-usgs any ideas?

Everything I dig into just opens more questions!!

image

library(sf)

query_gdp <- function(query_simplegeom) {
  gcm_job <- geoknife(
    stencil = query_simplegeom,
    fabric = webdata(
      url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_gfdl",
      variables = c("prcp_debias", "tas_debias", "rh_debias", "rsds_debias", "rsdl_debias", "windspeed_debias"),
      times = c("1981-01-01 00:00:00", "1981-02-01 00:00:00")
    )
    # The default knife algorithm is appropriate
  )
  wait(gcm_job)
  data_out <- result(gcm_job, with.units = TRUE)
  return(data_out)
}

red_data <- query_gdp(simplegeom(c(-94.87765, 47.79147)))
blue_data <- query_gdp(simplegeom(c(-94.81306, 47.81679)))
jordansread commented 2 years ago

wow, that is a surprising outcome...hmmm

We're using the feature-weighted summary algorithm of the GDP (that is the default, so we don't specify it). I know you are aware of that but stating in here to summarize the issue and my understanding of it. That algorithm will weight the areas of sections of the polygon that fall into different cells of the gridded data according to their percentage of area in the cells. We're using that algorithm but with a "point" we're creating a tiny little diamond polygon to use as the feature since GDP doesn't allow us to just use a true point (this is happening behind the scenes in geoknife, to buffer that point). Whenever the feature polygon is completely within the grid cell, we should get an identical result from GDP regardless of where in that grid cell the feature polygon is. So your red point and the blue point should return the same thing.

So...yes, I'm really surprised by this result.

A couple of things that could cause this:

1) is there a chance your blue dot is actually on the edge of a cell instead of the true centroid? I think this could be the case if the grid construction for the queries didn't match the underlying data grid, and was offset by 1/2 a cell in any or both directions. Then the blue "centroid" would be a diamond that gets a 1/2 or 1/4 of the weighted data from adjacent cells, and the red polygon would fall fully into one of those underlying data cells. Your email to Notaro should help this question too 2) Now I'm thinking about the grid and spatial info...if we didn't know or have confidence what the spatial data were for this dataset when it came in (I think it was missing some spatial attributes we have for other hosted datasets), how does the GDP know what underlying data grid exists and how to mesh up these two sources of data (the points and the underlying grid) together to get the right result?? Any info GDP has for this assumption/calculation would be info we'd have access to as well, but are there different things going on?

for 1), I'm not seeing much evidence that the blue centroid is close to an edge:

library(sf)
library(geoknife)

query_gdp <- function(query_simplegeom) {
    gcm_job <- geoknife(
        stencil = query_simplegeom,
        fabric = webdata(
            url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_gfdl",
            variables = c("prcp_debias"),
            times = c("1981-01-01 00:00:00", "1981-01-02 00:00:00")
        )
        # The default knife algorithm is appropriate
    )
    wait(gcm_job)
    data_out <- result(gcm_job, with.units = TRUE)
    return(data_out)
}
query_gdp(simplegeom(data.frame(centroid = c(-94.81306, 47.81679), left = c(-94.8131, 47.81679), right = c(-94.81294, 47.81679), below = c(-94.81294, 47.81675), above = c(-94.81294, 47.81682))))

      DateTime above below centroid left right    variable statistic time(day of year)  units
1   1981-01-01   NaN   NaN      NaN  NaN   NaN prcp_debias      MEAN                 0 mm/day
2   1981-01-01   NaN   NaN      NaN  NaN   NaN prcp_debias      MEAN                 1 mm/day
3   1981-01-01   NaN   NaN      NaN  NaN   NaN prcp_debias      MEAN                 2 mm/day

So...What about setting up a line of points that inches between the blue and red dot. Is there a point where we get data?

n_guesses = 14
matrix(c(seq(-94.87765, to= -94.81306, length.out = n_guesses), seq(47.79147, to= 47.81679, length.out = n_guesses)), nrow = 2, byrow = TRUE) %>% as.data.frame() %>% 
     simplegeom() %>% query_gdp()
          DateTime          V1         V10         V11         V12         V13 V14          V2          V3          V4          V5          V6          V7          V8          V9    variable statistic time(day of year)  units
1  1981-01-01 0.006079078 0.006079078 0.006079078 0.006079078 0.006079078 NaN 0.006079078 0.006079078 0.006079078 0.006079078 0.006079078 0.006079078 0.006079078 0.006079078 prcp_debias      MEAN                 0 mm/day
2  1981-01-01 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 NaN 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 prcp_debias      MEAN                 1 mm/day
3  1981-01-01 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 NaN 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 prcp_debias      MEAN                 2 mm/day
4  1981-01-01 6.515005000 6.515005000 6.515005000 6.515005000 6.515005000 NaN 6.515005000 6.515005000 6.515005000 6.515005000 6.515005000 6.515005000 6.515005000 6.515005000 prcp_debias      MEAN                 3 mm/day

huh, only the blue dot (V14) is NaN in that case So perhaps above when I was going left/right/up/down, I should've checked diagonally up/left. I'm thinking the blue dot may actually be at a corner after all??

jordansread commented 2 years ago

Update:

I'm not sure that my left, right, up, down nudges were actually wide enough to be outside of the geoknife buffer of the point. redo:

offset <- 0.01
query_gdp(simplegeom(data.frame(centroid = c(-94.81306, 47.81679), down = c(-94.81306, 47.81679-offset), up = c(-94.81306, 47.81679+offset), left = c(-94.81306-offset, 47.81679), right = c(-94.81306+offset, 47.81679),up_left = c(-94.81306-offset, 47.81679+offset), up_right = c(-94.81306+offset, 47.81679+offset), dwn_left = c(-94.81306-offset, 47.81679-offset), dwn_right = c(-94.81306+offset, 47.81679-offset))))
Process Accepted
     DateTime centroid        down    dwn_left   dwn_right left      right  up up_left   up_right    variable statistic time(day of year)  units
4  1981-01-01      NaN 6.515005000 6.515005000 7.633144000  NaN 7.18019700 NaN     NaN 7.18019700 prcp_debias      MEAN                 3 mm/day

I'm seeing 4 different values here, which I think means we're getting data from either 4 different underlying cells, or the buffered points are blending data slightly differently from ~3 different cells

A sketch of what this return of data looks like image

hcorson-dosch-usgs commented 2 years ago
  1. is there a chance your blue dot is actually on the edge of a cell instead of the true centroid? I think this could be the case if the grid construction for the queries didn't match the underlying data grid, and was offset by 1/2 a cell in any or both directions. Then the blue "centroid" would be a diamond that gets a 1/2 or 1/4 of the weighted data from adjacent cells, and the red polygon would fall fully into one of those underlying data cells. Your email to Notaro should help this question too
  2. Now I'm thinking about the grid and spatial info...if we didn't know or have confidence what the spatial data were for this dataset when it came in (I think it was missing some spatial attributes we have for other hosted datasets), how does the GDP know what underlying data grid exists and how to mesh up these two sources of data (the points and the underlying grid) together to get the right result?? Any info GDP has for this assumption/calculation would be info we'd have access to as well, but are there different things going on?

Lindsay and I landed on these same questions yesterday when we were chatting. I think it really comes down to that # 2.

I did some more checking of the grid information in the new netCDFs this morning, and while the extent is dramatically different than the old data (and could be incorrect -- hopefully Lindsay's email to Notaro clears this up), the good news is that it does appear that the spatial data the new netCDF contains is internally consistent (e.g., the extent of the grid constructed from the iy and jx values in that new netCDF matches the extent of the xlat and xlon values in that netCDF). And based on switching between WGS84 and the grid projection (as defined by the grid mapping variable that Notaro emailed, which matches the grid mapping variable in the previously used, biased netCDFs with the midwestern extent), it does seem like the spatial data is in the same crs.

Here blue dots = corners of grid constructed from iy and jx values. red dots = corners of xlat and xlon grid values. green dots and box = bounding points and box of xlat and xlon grid (wider than actual grid becuase grid curves in WGS84). grid = our reconstructed grid (from iy and jx values): image

And if we use the native grid projection: image

lindsayplatt commented 2 years ago

So based on that, we are confident that our grid reconstruction matches what the debiased (new) netCDF files have for both iy/jx and xlat/xlon. I think the remaining question is if there is no rcm_map attribute in the netCDFs that are on GDP, what does GDP assume the grid is in? Do we need to have David add the rcm_map attribute that we received from Notaro to the data on THREDDs? Then, we could retry those same queries and see if there is a change. Either way, seems like that rcm_map variable should end up with the original nc files.

jordansread commented 2 years ago

Here is what this looks like when calling GDP to ask for a subset of the underlying data based on the bounding box of the points we give it:

query_gdp <- function(query_simplegeom) {
    gcm_job <- geoknife(
        stencil = query_simplegeom,
        fabric = webdata(
            url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_mri",
            variables = c("prcp_debias"),
            times = c("1981-01-01 00:00:00", "1981-01-02 00:00:00"),
        ),
        knife = webprocess("subset", OUTPUT_TYPE="netcdf")

    )
    wait(gcm_job)
    return(gcm_job)
}
offset <- 0.01; 
geo_job <- query_gdp(simplegeom(data.frame(centroid = c(-94.81306, 47.81679), down = c(-94.81306, 47.81679-offset), up = c(-94.81306, 47.81679+offset), left = c(-94.81306-offset, 47.81679), right = c(-94.81306+offset, 47.81679),up_left = c(-94.81306-offset, 47.81679+offset), up_right = c(-94.81306+offset, 47.81679+offset), dwn_left = c(-94.81306-offset, 47.81679-offset), dwn_right = c(-94.81306+offset, 47.81679-offset))))
geoknife::download(geo_job, destination = '~/Downloads/test_subset_notarto.nc')
nc <- ncdf4::nc_open("~/Downloads/test_subset_notarto.nc")
ncvar_get(nc, "xlon")
          [,1]      [,2]      [,3]      [,4]
[1,] -95.32296 -95.31609 -95.30916 -95.30218
[2,] -94.98767 -94.97943 -94.97112 -94.96275
[3,] -94.65245 -94.64284 -94.63315 -94.62337
[4,] -94.31731 -94.30633 -94.29525 -94.28409

ncvar_get(nc, "xlat")
         [,1]     [,2]     [,3]     [,4]
[1,] 47.48399 47.71057 47.93709 48.16355
[2,] 47.47891 47.70547 47.93196 48.15841
[3,] 47.47290 47.69943 47.92591 48.15233
[4,] 47.46597 47.69247 47.91892 48.14531

~☝️ wow, the longitude cells are close together. 0.01°~ whoops, I was thinking about this incorrectly

the data shows NaNs appear in some cells but there isn't a hard edge in x or y. Not sure why these are NaN

ncvar_get(nc, "prcp_debias", start = c(1,1,1,1), count = c(-1,-1,1,-1))
     [,1]        [,2] [,3] [,4]
[1,]    0 0.000000000    0    0
[2,]    0 0.006079078  NaN  NaN
[3,]    0 0.005200386    0  NaN
[4,]    0 0.000000000    0    0

looks like this could be because they are on the edge of a lake? image

ncvar_get(nc, "xlat")[2,4]
[1] 48.15841
> ncvar_get(nc, "prcp_debias", start = c(1,1,1,1), count = c(-1,-1,1,-1))[2,4]
[1] NaN
> ncvar_get(nc, "xlon")[2,4]
[1] -94.96275
jordansread commented 2 years ago

Ok, I think I'm feeling more confident in what is going on here.

From the netcdf subset, I plotted the values for Longitude on Cross Points and Latitude on Cross Points. I wasn't sure if those were centroids of the cells or vertices/corners of the cell, but I think our math assumes they are vertices of the cell. I plotted those values as red when they had data and grey when they were NA (as "+" markers).

Then I plotted your "blue point" (cyan here), which falls in the centroid assuming those other points are corners. But then when adding the upper left, up, right, etc from above (grey when those were NA, blue when they had data) it seems likely that our assumed centroid is the lower right corner of the NA data cell (I drew a box below):

image

So, I'd guess either we've made a mistake assuming the meaning of the cross points (they are centroids instead of corners) or GDP has made a mistake on their meaning (they are corners, but being treated as centroids for data processing).

jordansread commented 2 years ago

perhaps offset needs to include a 0.5 width and 0.5 height?

hcorson-dosch-usgs commented 2 years ago

@jread-usgs

What I'm seeing in the underlying data (cells of 0.01°E or 0.3N°) doesn't seem to line up with the cell query map

What do you mean here by 'cells of 0.01°E or 0.3N°' ? That the cells are irregular, or that they aren't square?

jordansread commented 2 years ago

I deleted that comment (I think right after I posted it) because I was reading the lat/lon data incorrectly. Is that comment still up here somewhere?

lindsayplatt commented 2 years ago

I still see it with a map of the grid cells filled by number of lakes

jordansread commented 2 years ago

Ha! So it was just deleted from my view of this issue!? 😆 Good to know. Anyhow, ignore that comment and the one I crossed out above (that was the same incorrect read)

~☝️ wow, the longitude cells are close together. 0.01° whoops~, I was thinking about this incorrectly

lindsayplatt commented 2 years ago

I wanted to see where are the NAs were in the raw NC files (per the recent Notaro email), but I can't seem to extract any actual values from these nc files. I keep getting an error. When I look up the error, all the resolutions seem to point to a corrupt file. I can get time and xlon/xlat in this same way. I've also double-checked the variable names I am using match those in each nc file using names(nc$var). Am I doing something wrong?

library(ncdf4)

# Need to be on VPN
lapply(c( "prcp", "tas", "rh", "rsds", "rsdl", "windspeed"), function(var) {
  var_url <- sprintf('http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias/files/debias_%s_mri_mid21_longerdaymet_ghcn_365d.nc', var)
  nc <- nc_open(var_url) 
  values_df <- ncvar_get(nc, sprintf("%s_debias", var), start = c(0,0,0,0), count = c(1,1,1,1))
  nc_close(nc)
  return(values_df)
})

Error in ncvar_get_inner(ncid2use, varid2use, nc$var[[li]]$missval, addOffset,  : 
  C function R_nc4_get_vara_double returned error 

I get the same error when I don't specify start or count.

jordansread commented 2 years ago

start would need to be c(1,1,1,1) for this, since ncdf4 uses 1 index (R index) instead of 0

jordansread commented 2 years ago

And I think you'd need something like this @lindsayplatt

var <- "prcp"
var_url <- sprintf('http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias/files/debias_%s_mri_mid21_longerdaymet_ghcn_365d.nc', var)
nc <- nc_open(var_url) 
values_df <- ncvar_get(nc, sprintf("%s_debias", var), start = c(1,1,1,1), count = c(-1,-1,1,1))
nc_close(nc)

(see the -1,-1 on the count for the lat/lon slots, allowing you to get all space, but only one year and one day. -1 means "all")

lindsayplatt commented 2 years ago

Ah ok. Still doesn't resolve the error because I get the error when I don't specify start or count as I noted at the end of that comment. Doing that is returning an even scarier failure. Does that code work for you? I can get lat and lon just fine:

lon <- ncdf4::ncvar_get(nc, 'xlon')
lat <- ncdf4::ncvar_get(nc, 'xlat')

Running your code prints this scary HTML output:

syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <!doctype^ html><html lang="en"><head><title>HTTP Status 400 – Bad Request</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 400 – Bad Request</h1><hr class="line" /><p><b>Type</b> Exception Report</p><p><b>Message</b> Invalid character found in the request target [&#47;thredds&#47;dodsC&#47;notaro_debias&#47;files&#47;debias_prcp_mri_mid21_longerdaymet_ghcn_365d.nc.dods?prcp%5fdebias.prcp%5fdebias[0][0][0:140][0:216] ]. The valid characters are defined in RFC 7230 and RFC 3986</p><p><b>Description</b> The server cannot or will not process the request due to something that is perceived to be a client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing).</p><p><b>Exception</b></p><pre>java.lang.IllegalArgumentException: Invalid character found in the request target [&#47;thredds&#47;dodsC&#47;notaro_debias&#47;files&#47;debias_prcp_mri_mid21_longerdaymet_ghcn_365d.nc.dods?prcp%5fdebias.prcp%5fdebias[0][0][0:140][0:216] ]. The valid characters are defined in RFC 7230 and RFC 3986 org.apache.coyote.http11.Http11InputBuffer.parseRequestLine(Http11InputBuffer.java:509) org.apache.coyote.http11.Http11ProcError in Rsx_nc4_get_vara_double: NetCDF: Access failure
essor.service(Http11Processor.java:511) org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:831) org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1673) org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) org.apache.tomcat.util.threads.ThreadPoolExecutorVar: prcp_debias  Ndims: 4   Start: $Worker.run(ThreadPoolExecutor.java:659) org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) java.lang.Thread.run(Thread.java:748)</pre><p><b>Note</b> The full stack trace of the root cause is available in the server logs.</p><hr class="line" /><h3>Apache Tomcat</h3></body></html>
0,0,0,0 Count: 1,1,141,217
Error in ncvar_get_inner(ncid2use, varid2use, nc$var[[li]]$missval, addOffset,  : 
  C function R_nc4_get_vara_double returned error
jordansread commented 2 years ago

huh, yeah, it seemed to work for me: image

I haven't used the direct URL method to get more complicated data like this before. Lat/lon are simpler because they don't have other dimensions and probably don't have offsets/compression. Maybe the other variables are causing issues due to that?

lindsayplatt commented 2 years ago

Updating ncdf4 from 1.18 to 1.19 to see if that fixes it

lindsayplatt commented 2 years ago

That did not fix it. Le sigh, so much for a quick check of the raw nc file

lindsayplatt commented 2 years ago

I adjusted the grid so that we assume the points in the NetCDF are the centroid and then tried querying for those missing cells again (plus some nearby cells). The centroid of the cell and the centroid of the lakes now return the same result (so I can assume this means we have appropriately reconstructed the grid locally!):

library(targets)
library(sf)

na_cells <- c(17909, 18126, 18127, 18993)
nearby_cells <- c(18128, 18343, 18344)
all_cells <- c(na_cells, nearby_cells)

query_lakeids <- tar_read(lake_cell_xwalk_df) %>% filter(cell_no %in% all_cells)

cell_geom <- tar_read(grid_cells_sf) %>% 
  filter(cell_no %in% all_cells) %>% 
  st_transform(4326) 

cell_centroid <- tar_read(grid_cell_centroids_sf) %>% 
  filter(cell_no %in% all_cells) %>% 
  st_transform(4326) 

# Get centroid for each of these actual lakes
lake_centroids <- readRDS("2_crosswalk_munge/out/centroid_lakes_sf.rds") %>% 
  filter(site_id %in% query_lakeids$site_id) %>%
  left_join(select(query_lakeids, site_id, cell_no)) %>% 
  st_zm() %>% 
  st_centroid()

# Find centroid of the lake centroids per cell
centroid_of_cell_lakes <- lake_centroids %>% 
  bind_cols(as.data.frame(st_coordinates(lake_centroids))) %>% 
  st_drop_geometry() %>% 
  group_by(cell_no) %>% 
  summarize(X = mean(X), Y = mean(Y)) %>% 
  st_as_sf(coords = c("X", "Y"), crs = 4326)

# Construct the grid before the fix to show on the plot
original_cell_geom <- (tar_read(grid_cells_sf) %>% 
  filter(cell_no %in% all_cells) + 25000/2) %>%
  st_as_sf(crs = "+proj=lcc +lat_0=45 + lon_0=-97 +lat_1=36 +lat_2=52 +x_0=0 +y_0=0 +ellps=WGS84 +units=m") %>% 
  st_transform(4326)

plot(st_geometry(cell_geom))
plot(st_geometry(lake_centroids), add=T)
plot(st_geometry(centroid_of_cell_lakes), add=T, col="red", pch=16, cex=2)
plot(st_geometry(cell_centroid), add=T, col="blue", pch=16, cex=2)
plot(st_geometry(original_cell_geom), add=TRUE, border = "grey", lwd=2)

image

Here is what I did to test the GDP data of these cells

library(geoknife)

query_gdp <- function(query_simplegeom) {
  gcm_job <- geoknife(
    stencil = query_simplegeom,
    fabric = webdata(
      url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_ipsl",
      variables = c("prcp_debias"),# "tas_debias", "rh_debias", "rsds_debias", "rsdl_debias", "windspeed_debias"),
      times = c("1981-01-01 00:00:00", "1981-02-01 00:00:00")
    )
    # The default knife algorithm is appropriate
  )
  wait(gcm_job)
  data_out <- result(gcm_job, with.units = TRUE)
  return(data_out)
}

sf_pts_to_simplegeom <- function(sf_obj, id_col = "cell_no") {
  sf_obj %>%
    st_coordinates() %>%
    t() %>% as.data.frame() %>%
    # Include the cell numbers in the GDP geom so that the
    # results can be linked to the correct cells
    setNames(sf_obj[[id_col]]) %>%
    simplegeom()
}

cell_centroids_data <- query_gdp(sf_pts_to_simplegeom(cell_centroid)) %>% 
  mutate(date = as.Date(DateTime) + `time(day of year)`) %>% 
  select(date, all_of(as.character(seq_along(cell_centroid$cell_no)))) 
names(cell_centroids_data) <- c("date", cell_centroid$cell_no)

centroid_of_cell_lakes_data <- query_gdp(sf_pts_to_simplegeom(centroid_of_cell_lakes)) %>% 
  mutate(date = as.Date(DateTime) + `time(day of year)`) %>% 
  select(date, all_of(as.character(seq_along(centroid_of_cell_lakes$cell_no)))) 
names(centroid_of_cell_lakes_data) <- c("date", centroid_of_cell_lakes$cell_no)

cell_centroids_data_summary <- cell_centroids_data %>% 
  pivot_longer(-date, names_to = "cell_no") %>% 
  mutate(cell_no = as.numeric(cell_no)) %>% 
  group_by(cell_no) %>% 
  summarize(n = n(), 
            n_na = sum(is.na(value)),
            percent_na = n_na/n)
cell_centroids_data_summary_sf <- cell_centroid %>% left_join(cell_centroids_data_summary)

centroid_of_cell_lakes_data_summary <- centroid_of_cell_lakes_data %>% 
  pivot_longer(-date, names_to = "cell_no") %>% 
  mutate(cell_no = as.numeric(cell_no)) %>% 
  group_by(cell_no) %>% 
  summarize(n = n(), 
            n_na = sum(is.na(value)),
            percent_na = n_na/n)
centroid_of_cell_lakes_data_summary_sf <- centroid_of_cell_lakes %>% left_join(centroid_of_cell_lakes_data_summary)

ggplot() + 
  geom_sf(data = cell_centroids_data_summary_sf, aes(color = percent_na), size=2) +
  geom_sf(data = centroid_of_cell_lakes_data_summary_sf, aes(color = percent_na), size=2) +  
  scico::scale_fill_scico(palette = "batlow", direction = -1) +
  geom_sf(data = cell_geom, fill=NA, size = 1) +
  # geom_sf_label(data = cell_geom, aes(label = cell_no)) +
  theme_void()

# Values for centroid of cell & centroid of lakes within the cell give the same exact results!
# Choosing a cell that has data and one that doesn't have any

> centroid_of_cell_lakes_data %>% select(`17909`) %>% summary()
     17909        
 Min.   : 0.0000  
 1st Qu.: 0.0000  
 Median : 0.4505  
 Mean   : 2.4753  
 3rd Qu.: 1.4072  
 Max.   :78.8974  
> cell_centroids_data %>% select(`17909`) %>% summary()
     17909        
 Min.   : 0.0000  
 1st Qu.: 0.0000  
 Median : 0.4505  
 Mean   : 2.4753  
 3rd Qu.: 1.4072  
 Max.   :78.8974

> centroid_of_cell_lakes_data %>% select(`18344`) %>% summary()
     18344    
 Min.   : NA  
 1st Qu.: NA  
 Median : NA  
 Mean   :NaN  
 3rd Qu.: NA  
 Max.   : NA  
 NA's   :365  
> cell_centroids_data %>% select(`18344`) %>% summary()
     18344    
 Min.   : NA  
 1st Qu.: NA  
 Median : NA  
 Mean   :NaN  
 3rd Qu.: NA  
 Max.   : NA  
 NA's   :365  

Plot of percent NAs returned by either the centroid of the cell or centroid of the lake.

image

lindsayplatt commented 2 years ago

A before grid shift fix and after grid shift fix look into NA values for grid cells. Seems to have added more missing data than before?

image

image

# Function to download one month of data for all MN cells
download_and_plot_data <- function(mn_centroids_sf, mn_cells_sf, title) {

  query_gdp <- function(query_simplegeom) {
    gcm_job <- geoknife(
      stencil = query_simplegeom,
      fabric = webdata(
        url = "http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_ipsl",
        variables = c("prcp_debias", "tas_debias", "rh_debias", "rsds_debias", "rsdl_debias", "windspeed_debias"),
        times = c("1981-01-01 00:00:00", "1981-02-01 00:00:00")
      )
      # The default knife algorithm is appropriate
    )
    wait(gcm_job)
    data_out <- result(gcm_job, with.units = TRUE)
    return(data_out)
  }
  # Call this inside to overwrite the function loaded by the `source()` call for `get_query_cells()`
  sf_pts_to_simplegeom <- function(sf_obj, id_col = "cell_no") {
    sf_obj %>%
      st_coordinates() %>%
      t() %>% as.data.frame() %>%
      # Include the cell numbers in the GDP geom so that the
      # results can be linked to the correct cells
      setNames(sf_obj[[id_col]]) %>%
      simplegeom()
  }

  all_MN_cell_data <- query_gdp(sf_pts_to_simplegeom(mn_centroids_sf, id_col = "site_id"))
  all_MN_cell_data_formatted <- all_MN_cell_data %>% 
    select(DateTime, all_of(as.character(seq_along(mn_centroids_sf$cell_no)))) 
  names(all_MN_cell_data_formatted) <- c("DateTime", mn_centroids_sf$cell_no)

  # Summarize the month by counting how many NAs
  all_MN_cell_data_summary <- all_MN_cell_data_formatted %>% 
    pivot_longer(-DateTime, names_to = "cell_no") %>% 
    group_by(cell_no) %>% 
    summarize(na_count = sum(is.na(value)), 
              percent_na = na_count/n()) %>% 
    mutate(cell_no = as.numeric(cell_no)) 
  all_MN_cell_data_summary_sf <- mn_cells_sf %>% 
    left_join(all_MN_cell_data_summary, by = "cell_no")

  mn_sf <- st_as_sf(maps::map('state', 'minnesota', fill=TRUE, plot=FALSE))
  print(ggplot() + 
          geom_sf(data = all_MN_cell_data_summary_sf, aes(fill = percent_na)) +  
          scico::scale_fill_scico(palette = "batlow", direction = -1) +
          geom_sf(data = mn_sf, fill=NA, size = 1) +
          theme_void()+
          ggtitle(title)) 
}

# Construct the grid before the fix to show on the plot

source("7_drivers_munge/src/GCM_driver_utils.R") # Load `get_query_cells()`
original_cell_geom <- (tar_read(grid_cells_sf) + 25000/2) %>%
  st_as_sf(crs = "+proj=lcc +lat_0=45 + lon_0=-97 +lat_1=36 +lat_2=52 +x_0=0 +y_0=0 +ellps=WGS84 +units=m")
all_MN_cell_nos_orig <- get_query_cells(original_cell_geom, tar_read(query_lake_centroids_sf)) %>% pull(cell_no)
all_MN_cells_orig <- original_cell_geom %>% filter(cell_no %in% all_MN_cell_nos_orig) %>% st_transform(4326)
all_MN_cell_centroids_orig <- st_centroid(all_MN_cells_orig)

download_and_plot_data(all_MN_cell_centroids_orig, all_MN_cells_orig, "Before fix")

# Construct the grid with the current fix

all_MN_cells <- tar_read(grid_cells_sf) %>% filter(cell_no %in% tar_read(query_cells)) %>% st_transform(4326)
all_MN_cell_centroids <- st_centroid(all_MN_cells)

download_and_plot_data(all_MN_cell_centroids, all_MN_cells, "After fix")
lindsayplatt commented 2 years ago

The difference in missing cells do make sense given where the lakes are and how the new grid lines up:

image

image

mn_sf <- st_as_sf(maps::map('state', 'minnesota', fill=TRUE, plot=FALSE))
lake_centroids <- readRDS("2_crosswalk_munge/out/centroid_lakes_sf.rds") %>% 
  st_intersection(mn_sf)

# BEFORE THE FIX

plot(st_geometry(mn_sf), main = "Before fix")
plot(st_geometry(lake_centroids), add=T, col="grey")
plot(st_geometry(all_MN_cells_orig), add=T)

# AFTER THE FIX

plot(st_geometry(mn_sf), main="After fix")
plot(st_geometry(lake_centroids), add=T, col="grey")
plot(st_geometry(all_MN_cells), add=T)
hcorson-dosch-usgs commented 2 years ago

Also matches with what the raw gridded data show - this is from the subset Jordan pulled with

nccopy http://gdp-netcdfdev.cr.usgs.gov:8080/thredds/dodsC/notaro_debias_mri?time[0:1:1],year[0:1:1],jx[0:1:216],iy[0:1:140],prcp_debias[0:1:1][0:1:1][0:1:140][0:1:216] TOSS_NETCDF_NOTARO.nc

image

lindsayplatt commented 2 years ago

I am going to pick up here with dates next week (this happens in the munge step, so not difficult to do post-download).

# What is probably happening is that the GCMs don't acknowledge 
# leap years but when we turn `DateTime` & `time(day of year)` 
# into a date, it thinks there are leap days, which is why each
# year after a leap year is one day short.

munged_data <- tar_read(glm_ready_gcm_data_feather)[1] %>% 
  arrow::read_feather() %>%
  # Use only one cell to investigate date issues
  filter(cell == 13782) %>% 
  mutate(is_leapyear = lubridate::leap_year(time),
         year = as.numeric(format(time, "%Y")))

munged_data %>% 
  mutate(mon_day = format(time, "%m-%d")) %>% 
  filter(mon_day == "02-29")

munged_data %>% 
  group_by(year) %>% 
  summarize(n = n(),
            is_leap = unique(is_leapyear))

# A tibble: 60 x 3
year     n is_leap
<dbl> <int> <lgl>  
1  1981   365 FALSE  
2  1982   365 FALSE  
3  1983   365 FALSE  
4  1984   366 TRUE   
5  1985   364 FALSE  
6  1986   365 FALSE  
7  1987   365 FALSE  
8  1988   366 TRUE   
9  1989   364 FALSE  
10  1990   365 FALSE  
# ... with 50 more rows
lindsayplatt commented 2 years ago

The leap year issue is more complicated than I thought. For some reason, the DateTime column has bizzar times and 1985 has a start date of 1984-12-31 23:15:03. Not sure why these times are here? I had been assuming times were all 00:00:00. I was going to use the values of DateTime to figure out if it was a leap year or not, but feels a bit weird to do that when the start of 1985 is actually 1984-12-31 23:15:03.

image

Here are all the unique ones that we add time(day of year) to in order to get the actual date.

image

hcorson-dosch-usgs commented 2 years ago

Hmm, that is odd. That DateTime field is returned by GDP and then saved in the raw feather files? I'm just wondering where that comes from, since the raw netCDF files only have a time field, that is day of year, and a year field: image

lindsayplatt commented 2 years ago

Yes, the DateTime column is what is returned from GDP.

lindsayplatt commented 2 years ago

I have a solution but wowza ... it was pretty complicated. I end up with 365 records per year, which is exactly what we want

jordansread commented 2 years ago

Just wanted to weigh in on the "DateTime comes from GDP"

image

This is the raw result from GDP. Perhaps something is funky with result if you are getting those bad dates?

lindsayplatt commented 2 years ago

That's what I get for the first year, too. What does your date next to 1984 look like?

jordansread commented 2 years ago

Yuck! image

lindsayplatt commented 2 years ago

Yeah :(

lindsayplatt commented 2 years ago

Redid the plots from https://github.com/USGS-R/lake-temperature-model-prep/issues/273#issuecomment-1027320572 with the new grid + my fix for leap year dates. However, the Winslow data and these still treated leap years differently so there was a mismatch but ONLY for leap years. I did the set of plots with all the dates included and then one where I removed any data from leap years. Feeling confident about the grid we've got going on now.

Leap years still in data, but fixed to be better handled in new GCMs + new grid

       10%        50%        90% 
-0.2699306  0.0000000  0.0000000 

image

image

Leap years removed entirely from both datasets + new grid

10% 50% 90% 
  0   0   0 

image

image

lindsayplatt commented 2 years ago

Hmmm although looking at the mismatch between my fixed data and the Winslow data shows that my fix isn't quite working for all the years. I now have Dec 31 for all leap years in our dataset, BUT we shouldn't have NAs for 03-01, only for 02-29. A bit more tweaking needed ...

image