Closed lindsayplatt closed 2 years ago
Does the new grid correspond to the extent of this? (from here) where the old version was a subset of it? Or is the new grid even bigger?
Having some weirdly failing queries. When I ran myself, the error came up during the wait()
call, but then when you check the statusType
is succeeded. @jread-usgs, do you recognize this error? I can see the "Process Accepted" message and then it fails when it gets to the wait()
call and causes my whole target to fail.
wait(gcm_job)
[=====================================================================================================] 100%Error in pb_update(self, private, ratio, tokens) :
!self$finished is not TRUE
> gcm_job_status <- check(gcm_job)$statusType
> check(gcm_job)$statusType
[1] "ProcessSucceeded"
> my_data <- result(gcm_job, with.units = TRUE) # This works to load! It is 131,400 rows with 74 columns
^ @jread-usgs do you have any thoughts about the error I am seeing?
I think it is related to this: https://github.com/USGS-R/geoknife/issues/366
But shouldn't have any impact on the results, this is just an error with checking how far along the job is as far as I understand it. Looks like you still got data, which wouldn't happen if the job failed.
Ah so I just need to update for now to not have it throw that error 👍
I'm assuming you already have that update in your pkg, but perhaps you are seeing another instance of that progress bar issue?
Oh, you are right. I did have the update. I ended up adding gconfig('sleep.time' = 45, 'wait' = TRUE)
before my call and that got around the error. Should I use that or is it too temporary of a solution?
Yeah, but changing your sleep to 45 secs, you are going to reduce the number of times you ping GDP and thereby reduce the chance of you hitting this error. I assumed you were doing wait
already, but just calling it within the geoknife
function instead of using the config.
I was running this:
gcm_job <- geoknife(
stencil = query_simplegeom,
fabric = webdata(
url = query_url,
variables = c(
'windspeed_debias',
'tas_debias', # air temp
'rsds_debias', #shortwave radiation
'rsdl_debias', # longwave radiation
'rh_debias', # relative humidity
'prcp_debias'), # precip
times = c('1980-01-01', '2100-01-01')
)
# The default knife algorithm is appropriate
)
wait(gcm_job)
gcm_job_status <- check(gcm_job)$statusType
So doing the wait()
right after? Sorry about confused about what the solution to getting passed this is?
basically geoknife(..., wait = TRUE)
or a simple example
job <- geoknife(stencil = c(-89,42), fabric = 'prism', wait = TRUE)
But setting the config value modifies whether wait is TRUE or FALSE by default.
Hmm I added wait = TRUE
as an argument but still get the error. I think I could benefit from a short call to understand what is going on and what needs to be done to get passed it. I'll ping you on Teams.
Still waiting on my download to finish all the branches, but it has been marching along just fine so far. 30 out of 54 are complete. The only piece left is to verify the grid. I have added that as an item to #273, so that we can do that before calling our switch to the new GCMs complete. In order to get some of these grid and query (MN) changes to @hcorson-dosch to keep working on the NetCDF, we are going to merge this PR.
OK, the download finished!!! 🎉🎉🎉 Reminder that this download was for 9 tiles covering all of the MN lakes that we have in this pipeline so far. Each query was for all 3 of the projection time periods at once (21,900 days per query) and there were 6 GCMs (that's 54 queries in total!). Before sending to Hayley, I spent some time exploring. There are a couple of flags that I see:
# These libraries are needed for the rest of the code chunks
library(targets)
library(sf)
library(tidyverse)
tar_load(lake_cell_xwalk_df)
tar_load(grid_cells_sf)
tar_load(grid_tiles_sf)
tar_load(query_cells)
tar_load(query_tiles)
lakes_per_cell <- lake_cell_xwalk_df %>% group_by(cell_no) %>% summarize(nlakes = n())
mn_grid_tiles_sf <- grid_tiles_sf %>% filter(tile_no %in% query_tiles)
mn_grid_cells <- grid_cells_sf %>% filter(cell_no %in% query_cells) %>% left_join(lakes_per_cell)
ggplot() +
geom_sf(data = mn_grid_cells, aes(fill = nlakes)) +
scico::scale_fill_scico(palette = "batlow", direction = -1) +
geom_sf(data = mn_grid_tiles_sf, fill = NA, size = 2)
Using the same function from here, I found out the following diagnostics for run time and file size. In total, it took 16 hours to download all the GCM data for all of MN. Seems OK, but not amazing. Not ideal as we scale, but I think we can move this part to HPC to help make it not as painful. Painful this time around only because I had an issue with the VPN connection timing out overnight & not letting my computer go to sleep. Running on HPC would fix both of those things.
# How long did the GDP download take and how big are the resulting files?
tar_load(gcm_data_raw_feather)
target_metadata_download <- summarize_target_build(names(gcm_data_raw_feather))
-- DURATION --
Shortest: 4.021 min
Longest: 44.123 min
Total: 988.789 min
-- FILE SIZES --
Smallest: 4.377 mb
Biggest: 95.938 mb
Total: 2116.269 mb
# Load and combine ALL downloaded GCMs (6 GCMs, 9 tiles w/ varying number of cells)
tar_load(glm_ready_gcm_data_feather)
munged_gcm_drivers <- purrr::map(glm_ready_gcm_data_feather, arrow::read_feather) %>% bind_rows()
# Range of each chunk:
munged_gcm_drivers %>% filter(time < as.Date('2030-01-01')) %>% pull(time) %>% range()
munged_gcm_drivers %>% filter(time > as.Date('2030-01-01'),
time < as.Date('2070-01-01')) %>% pull(time) %>% range()
munged_gcm_drivers %>% filter(time > as.Date('2070-01-01')) %>% pull(time) %>% range()
"1981-01-01" "2000-12-30"
"2040-01-01" "2059-12-30"
"2080-01-01" "2099-12-30"
munged_gcm_nas <- munged_gcm_drivers %>%
mutate(across(.cols = matches("[A-Z]{1}[a-z]+", ignore.case = F), .fns = is.na)) %>%
rowid_to_column() %>%
filter_at(.vars = vars(-rowid, -time, -cell), .vars_predicate = any_vars(.))
# 4 cells account for all NAs when looking at data from all 6 GCMs
# 17909, 18126, 18127, 18993
na_cells <- munged_gcm_nas %>%
group_by(cell) %>%
summarize(count = n()) %>%
pull(cell)
# Where are these cells?
tar_load(grid_cells_sf)
tar_load(grid_tiles_sf)
tar_load(query_cells)
tar_load(query_tiles)
mn_grid_tiles_sf <- grid_tiles_sf %>% filter(tile_no %in% query_tiles)
mn_grid_cells_sf <- grid_cells_sf %>% filter(cell_no %in% query_cells)
grid_cells_na <- grid_cells_sf %>% filter(cell_no %in% na_cells)
ggplot() +
geom_sf(data=mn_grid_cells_sf) +
geom_sf(data=grid_cells_na, fill = "orange") +
geom_sf(data = mn_grid_tiles_sf, fill = NA, size = 2)
daily_summary <- munged_gcm_drivers %>%
select(-cell) %>%
group_by(time) %>%
summarize_all(list(min, mean, max), na.rm = TRUE) %>%
pivot_longer(-time) %>%
tidyr::separate(name, c("variable", "fxn"), sep = "_") %>%
mutate(fxn = case_when(
fxn == "fn1" ~ "min",
fxn == "fn2" ~ "mean",
fxn == "fn3" ~ "max"))
ggplot(daily_summary, aes(time, value, color = fxn)) +
geom_line() +
facet_grid(variable ~ ., scales = "free_y")
Reminder of units:
I have noted the remaining questions about this new GCM data (grid, dates, and missing cells) in the related issue #273. This code was approved prior to the download finishing (which it now has), so I am going to merge and allow those additional questions to be answered and potentially addressed in future PRs.
@lindsayplatt - at the moment I'm seeing GLM failures in lakes that cover more than 4 cells, but it might be helpful for my debugging purposes to know which 4 cells have NA values anyhow -- do you have that info?
I am rerunning the code to get the cell numbers. Taking longer than I thought it would!
Update - the four cells are c(17909, 18126, 18127, 18993)
. Thanks, Lindsay!
Meant to tell you that I added those into the code above when I sent them earlier! Thanks for making sure they exist here, too!
Scaling up to MN will require one additional step, which is to have a more robust way to subset our lakes to MN than what I did here. I was able to test this with those 5 specific NML sites (see this commit) and it worked (see the notes here). Now I am running this for our new
This addresses #273 and #258. Leaving as draft until my
tar_make(glm_ready_gcm_data_feather)
command finishes.Note that this pull request includes the results following a
git pull origin main
, resulting in so many commits and changes. Since we have kept the GCM pipeline in a separate branch, we did this step to get the updates frommain
into this branch. This commit is really the first commit of new changes related to the GCM work.