DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

Code to switch to new GCMs & scale up to MN #276

Closed lindsayplatt closed 2 years ago

lindsayplatt commented 2 years ago

Scaling up to MN will require one additional step, which is to have a more robust way to subset our lakes to MN than what I did here. I was able to test this with those 5 specific NML sites (see this commit) and it worked (see the notes here). Now I am running this for our new

This addresses #273 and #258. Leaving as draft until my tar_make(glm_ready_gcm_data_feather) command finishes.

Note that this pull request includes the results following a git pull origin main, resulting in so many commits and changes. Since we have kept the GCM pipeline in a separate branch, we did this step to get the updates from main into this branch. This commit is really the first commit of new changes related to the GCM work.

jordansread commented 2 years ago

Does the new grid correspond to the extent of this? image (from here) where the old version was a subset of it? Or is the new grid even bigger?

lindsayplatt commented 2 years ago

Having some weirdly failing queries. When I ran myself, the error came up during the wait() call, but then when you check the statusType is succeeded. @jread-usgs, do you recognize this error? I can see the "Process Accepted" message and then it fails when it gets to the wait() call and causes my whole target to fail.

wait(gcm_job)
[=====================================================================================================] 100%Error in pb_update(self, private, ratio, tokens) : 
  !self$finished is not TRUE
> gcm_job_status <- check(gcm_job)$statusType
> check(gcm_job)$statusType
[1] "ProcessSucceeded"
> my_data <- result(gcm_job, with.units = TRUE) # This works to load! It is 131,400 rows with 74 columns
lindsayplatt commented 2 years ago

^ @jread-usgs do you have any thoughts about the error I am seeing?

jordansread commented 2 years ago

I think it is related to this: https://github.com/USGS-R/geoknife/issues/366

But shouldn't have any impact on the results, this is just an error with checking how far along the job is as far as I understand it. Looks like you still got data, which wouldn't happen if the job failed.

lindsayplatt commented 2 years ago

Ah so I just need to update for now to not have it throw that error 👍

jordansread commented 2 years ago

I'm assuming you already have that update in your pkg, but perhaps you are seeing another instance of that progress bar issue?

lindsayplatt commented 2 years ago

Oh, you are right. I did have the update. I ended up adding gconfig('sleep.time' = 45, 'wait' = TRUE) before my call and that got around the error. Should I use that or is it too temporary of a solution?

jordansread commented 2 years ago

Yeah, but changing your sleep to 45 secs, you are going to reduce the number of times you ping GDP and thereby reduce the chance of you hitting this error. I assumed you were doing wait already, but just calling it within the geoknife function instead of using the config.

lindsayplatt commented 2 years ago

I was running this:

gcm_job <- geoknife(
  stencil = query_simplegeom,
  fabric = webdata(
    url = query_url,
    variables = c(
      'windspeed_debias', 
      'tas_debias', # air temp
      'rsds_debias', #shortwave radiation
      'rsdl_debias', # longwave radiation
      'rh_debias', # relative humidity
      'prcp_debias'), # precip
    times = c('1980-01-01', '2100-01-01')
  )
  # The default knife algorithm is appropriate
)
wait(gcm_job)
gcm_job_status <- check(gcm_job)$statusType

So doing the wait() right after? Sorry about confused about what the solution to getting passed this is?

jordansread commented 2 years ago

basically geoknife(..., wait = TRUE) or a simple example

job <- geoknife(stencil = c(-89,42), fabric = 'prism', wait = TRUE)

But setting the config value modifies whether wait is TRUE or FALSE by default.

lindsayplatt commented 2 years ago

Hmm I added wait = TRUE as an argument but still get the error. I think I could benefit from a short call to understand what is going on and what needs to be done to get passed it. I'll ping you on Teams.

lindsayplatt commented 2 years ago

Still waiting on my download to finish all the branches, but it has been marching along just fine so far. 30 out of 54 are complete. The only piece left is to verify the grid. I have added that as an item to #273, so that we can do that before calling our switch to the new GCMs complete. In order to get some of these grid and query (MN) changes to @hcorson-dosch to keep working on the NetCDF, we are going to merge this PR.

lindsayplatt commented 2 years ago

OK, the download finished!!! 🎉🎉🎉 Reminder that this download was for 9 tiles covering all of the MN lakes that we have in this pipeline so far. Each query was for all 3 of the projection time periods at once (21,900 days per query) and there were 6 GCMs (that's 54 queries in total!). Before sending to Hayley, I spent some time exploring. There are a couple of flags that I see:

  1. The start/end times of the new debiased GCMs aren't the same as the old ones. Weirdly, the first chunk is shifted one year and all of them end on 12-30 rather than 12-31.
  2. There are 4 cells that account for all of the NAs. Need to explore why these wouldn't have data.

Map of the query

# These libraries are needed for the rest of the code chunks
library(targets)
library(sf)
library(tidyverse)

tar_load(lake_cell_xwalk_df)
tar_load(grid_cells_sf)
tar_load(grid_tiles_sf)
tar_load(query_cells)
tar_load(query_tiles)

lakes_per_cell <- lake_cell_xwalk_df %>% group_by(cell_no) %>% summarize(nlakes = n())
mn_grid_tiles_sf <- grid_tiles_sf %>% filter(tile_no %in% query_tiles)
mn_grid_cells <- grid_cells_sf %>% filter(cell_no %in% query_cells) %>% left_join(lakes_per_cell)

ggplot() + 
  geom_sf(data = mn_grid_cells, aes(fill = nlakes)) +
  scico::scale_fill_scico(palette = "batlow", direction = -1) +
  geom_sf(data = mn_grid_tiles_sf, fill = NA, size = 2)

image

Download diagnostics

Using the same function from here, I found out the following diagnostics for run time and file size. In total, it took 16 hours to download all the GCM data for all of MN. Seems OK, but not amazing. Not ideal as we scale, but I think we can move this part to HPC to help make it not as painful. Painful this time around only because I had an issue with the VPN connection timing out overnight & not letting my computer go to sleep. Running on HPC would fix both of those things.


# How long did the GDP download take and how big are the resulting files?
tar_load(gcm_data_raw_feather)
target_metadata_download <- summarize_target_build(names(gcm_data_raw_feather))

-- DURATION --
Shortest: 4.021 min
Longest: 44.123 min
Total: 988.789 min

-- FILE SIZES --
Smallest: 4.377 mb
Biggest: 95.938 mb
Total: 2116.269 mb

Data diagnostics

# Load and combine ALL downloaded GCMs (6 GCMs, 9 tiles w/ varying number of cells)
tar_load(glm_ready_gcm_data_feather)
munged_gcm_drivers <- purrr::map(glm_ready_gcm_data_feather, arrow::read_feather) %>% bind_rows()

Checking the data ranges (as described in item 1 above)

# Range of each chunk:
munged_gcm_drivers %>% filter(time < as.Date('2030-01-01')) %>% pull(time) %>% range()
munged_gcm_drivers %>% filter(time > as.Date('2030-01-01'), 
                              time < as.Date('2070-01-01')) %>% pull(time) %>% range()
munged_gcm_drivers %>% filter(time > as.Date('2070-01-01')) %>% pull(time) %>% range()

"1981-01-01" "2000-12-30"
"2040-01-01" "2059-12-30"
"2080-01-01" "2099-12-30"

Understanding the NA values (as described in item 2 above)

munged_gcm_nas <- munged_gcm_drivers %>% 
  mutate(across(.cols = matches("[A-Z]{1}[a-z]+", ignore.case = F), .fns = is.na)) %>% 
  rowid_to_column() %>% 
  filter_at(.vars = vars(-rowid, -time, -cell), .vars_predicate = any_vars(.))

# 4 cells account for all NAs when looking at data from all 6 GCMs
# 17909, 18126, 18127, 18993
na_cells <- munged_gcm_nas %>% 
  group_by(cell) %>%
  summarize(count = n()) %>% 
  pull(cell)

# Where are these cells?
tar_load(grid_cells_sf)
tar_load(grid_tiles_sf)
tar_load(query_cells)
tar_load(query_tiles)

mn_grid_tiles_sf <- grid_tiles_sf %>% filter(tile_no %in% query_tiles)
mn_grid_cells_sf <- grid_cells_sf %>% filter(cell_no %in% query_cells)
grid_cells_na <- grid_cells_sf %>% filter(cell_no %in% na_cells)

ggplot() +
  geom_sf(data=mn_grid_cells_sf) + 
  geom_sf(data=grid_cells_na, fill = "orange") +
  geom_sf(data = mn_grid_tiles_sf, fill = NA, size = 2)

image

Checking the ranges of each variable

daily_summary <- munged_gcm_drivers %>% 
  select(-cell) %>% 
  group_by(time) %>% 
  summarize_all(list(min, mean, max), na.rm = TRUE) %>% 
  pivot_longer(-time) %>% 
  tidyr::separate(name, c("variable", "fxn"), sep = "_") %>% 
  mutate(fxn = case_when(
    fxn == "fn1" ~ "min", 
    fxn == "fn2" ~ "mean", 
    fxn == "fn3" ~ "max"))

ggplot(daily_summary, aes(time, value, color = fxn)) +
  geom_line() + 
  facet_grid(variable ~ ., scales = "free_y")

Reminder of units:

image

lindsayplatt commented 2 years ago

I have noted the remaining questions about this new GCM data (grid, dates, and missing cells) in the related issue #273. This code was approved prior to the download finishing (which it now has), so I am going to merge and allow those additional questions to be answered and potentially addressed in future PRs.

hcorson-dosch-usgs commented 2 years ago

@lindsayplatt - at the moment I'm seeing GLM failures in lakes that cover more than 4 cells, but it might be helpful for my debugging purposes to know which 4 cells have NA values anyhow -- do you have that info?

lindsayplatt commented 2 years ago

I am rerunning the code to get the cell numbers. Taking longer than I thought it would!

hcorson-dosch-usgs commented 2 years ago

Update - the four cells are c(17909, 18126, 18127, 18993). Thanks, Lindsay!

lindsayplatt commented 2 years ago

Meant to tell you that I added those into the code above when I sent them earlier! Thanks for making sure they exist here, too!