Closed lindsayplatt closed 2 years ago
Here is the approach I am working through now:
data.table
for all of the major stuff here.x
such that the bin for Notaro between 0 and x
contains 3% of the Notaro observations.Initial global distributions for NLDAS (red) and Notaro (black)
Here is an updated distribution with the more resolved bins I am using to get better intersections on where the Notaro cuts need to be, as well as filtering the dates to only those between '1981-01-01' and '2000-12-31', inclusive:
Note the large number of zeros
Here is the calculated offsets for the Notaro dataset so that it aligns better with the distribution of NLDAS wind
This is based on bin cuts like this:
So we'd use low_bound
with the cut()
function, group_by those cut bins, and apply the offset.
Wow those offsets are tricky to conceptualize visually. The method you laid out makes sense, but I don't think I'm tracking why the offset peaks at ~10 m/s windspeed. Does the steady increase in the offset up to that peak result from a need to flatten the peak of the GCM distribution? I suppose what's not captured in that figure is a sense of how many values fall within each cut bin.
edited to add whoops, that sent before I was done. I was also wondering - am I correct that the offset value is applied to each existing GCM value within each cut bin? e.g. a GCM value of 0.5 becomes 1.55 and a GCM value of 0.55 becomes 1.6? Or do all values within a given cut bin simply get reassigned to that new_val
in your tibble, e.g., a GCM value of 0.5 becomes 1.05, and a GCM value of 0.55 becomes 1.05. From the upper plot, I think it's the former, where Notaro offset
= new_val
- low_bound
, but I just wanted to make sure I'm understanding the transformation correctly.
This is what I'm imagining in my head:
do all values within a given cut bin simply get reassigned to that new_val in your tibble, e.g., a GCM value of 0.5 becomes 1.05, and a GCM value of 0.55 becomes 1.05
This is the correct interpretation of what was done. Anything within the bin gets new_val
regardless of where along the bin it falls. This is due to discretizing the approach (binning vs solving for a continuous transform function).
As for the 10 m/s question, my interpretation of that is that ~11.5 m/s observations are very rare in the GCM data and somewhat more common in the NLDAS dataset. In order to increase the frequency of 11.5 m/2 windspeeds in the GCM data to match NLDAS, the largest shift needed to be applied to GCM values around 10 m/s.
I have an update now with this shift applied to the entire GCM dataset, and then binning it to the same bins as NLDAS for a distribution (I assume the bumps are an artifact of the discretization):
and the cumulative distribution is identical to the eye (at least for me):
Ah, thank you for clarifying. That makes sense now. The new distribution looks good!
@lindsayplatt a heads up on the methods here. I used the GCM files on Caldera, but am 99.999% sure they had a 1.5 m/s offset applied to them. So I subtracted that from all files before doing this distribution work. If someone is applying these offsets directly to those data without first removing the 1.5 m/s offsets, things aren't going to work out. And if that offset wasn't exactly 1.5, then I'd want to repeat this.
Here is how I applied the offset to an individual file. Offset is in the notaro_transform
data.frame and I took a screenshot of what that looks like above. bin_max
is 100 here and just meant to be way beyond the maximum expected value.
DT_wind <- data.table::fread(filepath)[time %between% date_range]
# note that we're SUBTRACTING THE OFFSET HERE!!!
DT_wind[, WindSpeed := WindSpeed - 1.5]
# this creates a factor that we can group by and apply the offset:
DT_wind[, bin_id := cut(WindSpeed, breaks = c(notaro_transform$low_bound, bin_max), right = FALSE)]
merged_offset_wind <- merge(DT_wind, DT_offsets, by = 'bin_id', all.x = TRUE)
merged_offset_wind[, WindSpeed := new_val]
merged_offset_wind[, c("low_bound", "new_val", "bin_id") := NULL]
@jordansread I think it may depend on what you mean by GCM files on Caldera. The ones in lake-temperature-model-prep
would not have the +1.5m/s offset because I added that in during the lake-temperature-process-models
GCM munging step by adding mutate(WindSpeed = WindSpeed + 1.5)
right below this line.
Yes @lindsayplatt I used the .csv files on Caldera in lake-temperature-process-models
for the Notaro GCM data but used the .csv files from lake-temperature-model-prep
for NLDAS for this work.
OK, then they are probably impacted because I didn't think we needed to re-run anything in that pipeline until after your transform fix. Was thinking that would end up in lake-temperature-model-prep
instead. I think we can easily rebuild the p1_gcm_csvs
target in the main
branch so that you can work with the CSVs without that offset if you'd like.
There is some bad DRY stuff in here, but I didn't have enough time to address that. Here is how I created this distribution test
# bin NLDAS wind into 0.1 m/s bins, calc frequency for each bin **BUT**
# will need to subset for only overlapping periods
# **BUT** will also need to filter to only lakes that xwalk to both sources
# 7_drivers_munge/out/7_drivers_munged.feather has all of the .csv files
library(data.table)
library(progress)
library(tidyverse)
nldas_xwalk <- readRDS('7_config_merge/out/nml_meteo_fl_values.rds')
# from rsync -avz jread@caldera-dtn.cr.usgs.gov:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models/1_prep/in/lake_cell_tile_xwalk.csv ../lake_cell_tile_xwalk.csv
notaro_xwalk <- read_csv('/Volumes/ThunderBlade/lake_cell_tile_xwalk.csv')
merged_xwalk <- inner_join(nldas_xwalk, notaro_xwalk, by = 'site_id')
driver_files <- arrow::read_feather('7_drivers_munge/out/7_drivers_munged.feather') %>%
mutate(meteo_fl = basename(driverpath)) %>%
filter(meteo_fl %in% merged_xwalk$meteo_fl)
bin_w <- 0.1
bin_max <- 100
date_range <- c('1981-01-01', '2000-12-31')
binned_nldas <- data.table(bin_id = cut(seq(0, to = bin_max - bin_w/10, by = bin_w),
breaks = seq(0, to = bin_max, by = bin_w), right = FALSE),
bin_cnt = 0L)
pb <- progress_bar$new(total = length(driver_files$driverpath))
for (filepath in driver_files$driverpath){
DT_wind <- data.table::fread(filepath)[time %between% date_range]
DT_wind[, bin_id := cut(WindSpeed, breaks = seq(0, to = bin_max, by = bin_w), right = FALSE)]
DT_cnt <- DT_wind[ , .(bin_cnt = length(WindSpeed)), by = bin_id]
# the merge will create NA for bins that this file doesn't have values in, replace w/ 0:
merged_nldas <- merge(DT_cnt, binned_nldas, by = 'bin_id', all.y = TRUE)[is.na(bin_cnt.x), bin_cnt.x := 0]
binned_nldas <- merged_nldas[, bin_cnt := bin_cnt.x + bin_cnt.y][, c('bin_id','bin_cnt'), with =FALSE]
pb$tick()
}
# access GCM data, bin to 0.01, re-write as bin counts
# from folder,
# `rsync -avz --include '*/' --include 'GCM_GFDL_1981_2000_183*' --include 'GCM_MRI_1981_2000_183*' --include 'GCM_MIROC5_1981_2000_183*' --include 'GCM_CNRM_1981_2000_183*' --include 'GCM_ACCESS_1981_2000_183*' --include 'GCM_IPSL_1981_2000_183*' --exclude '*' jread@caldera-dtn.cr.usgs.gov:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models/1_prep/out/ ./`
# but now need to filter these!!
GCM_driver_files <- dir('/Volumes/ThunderBlade/Notaro_downscaled_GLM/', full.names = TRUE)
bin_w <- 0.005 # change as desired for higher sampling/precision of downscaled data, has impact on bin alignment for transform
binned_notaro <- data.table(bin_id = cut(seq(0, to = bin_max - bin_w/10, by = bin_w),
breaks = seq(0, to = bin_max, by = bin_w), right = FALSE),
bin_cnt = 0L)
pb <- progress_bar$new(total = length(GCM_driver_files))
for (filepath in GCM_driver_files){
DT_wind <- data.table::fread(filepath)[time %between% date_range]
# note that we're SUBTRACTING THE OFFSET HERE!!!
DT_wind[, WindSpeed := WindSpeed - 1.5]
DT_wind[, bin_id := cut(WindSpeed, breaks = seq(0, to = bin_max, by = bin_w), right = FALSE)]
DT_cnt <- DT_wind[ , .(bin_cnt = length(WindSpeed)), by = bin_id]
# the merge will create NA for bins that this file doesn't have values in, replace w/ 0:
merged_notaro <- merge(DT_cnt, binned_notaro, by = 'bin_id', all.y = TRUE)[is.na(bin_cnt.x), bin_cnt.x := 0]
binned_notaro <- merged_notaro[, bin_cnt := bin_cnt.x + bin_cnt.y][, c('bin_id','bin_cnt'), with =FALSE]
pb$tick()
}
binned_notaro$notaro_bin_total <- cumsum(binned_notaro$bin_cnt)/sum(binned_notaro$bin_cnt)
binned_ranked <- binned_nldas %>% mutate(nldas_bin_total = cumsum(bin_cnt)/sum(bin_cnt)) %>%
rowwise() %>% mutate(notaro_bin = which(binned_notaro$notaro_bin_total >= nldas_bin_total)[1]) %>%
ungroup() %>% mutate(low_bound = 0, new_val = 0)
for (j in 2:nrow(binned_ranked)){
# get the cut bin from the distribution we're transforming _from_
lower_cut_text <- as.character(binned_notaro$bin_id)[binned_ranked$notaro_bin[j-1]]
lower_bnd <- unlist(strsplit(gsub("(?![,.])[[:punct:]]", "", as.character(lower_cut_text), perl=TRUE), ","))[2] %>%
as.numeric() %>% {. - bin_w}
binned_ranked$low_bound[j] <- lower_bnd
# get the middle of the cut bin from the distribution we're transforming _to_
# this code takes a factor value like this: "[0.1,0.2)" and calculates the mean of the bins, so you'd get 0.15 in this case:
binned_ranked$new_val[j] <- unlist(strsplit(gsub("(?![,.])[[:punct:]]", "", as.character(binned_ranked$bin_id[j]), perl=TRUE), ","))[1:2] %>%
as.numeric() %>% mean
}
notaro_transform <- binned_ranked %>%
filter(bin_cnt > 0) %>% group_by(low_bound) %>% summarise(new_val = sum(new_val * bin_cnt) / sum(bin_cnt))
saveRDS(notaro_transform, '/Volumes/ThunderBlade/notaro_wind_transform.rds')
See above for how this can be applied to a file
I know that you did this application using the lake-temperature-process-models
GCM files, but I am wondering if we want to include this here instead? That way, the same transform exists for the driver files that Andy might use?
After a deep dive into reasons behind some of the warm bias we were seeing in GCM-driven GLM output, we had a hunch that lower windspeeds in the GCMs than are present in NLDAS may be the issue. We did a test by adding 1.5 m/s to the windspeeds just to see how that would impact them and saw a significant change. These outcomes have encouraged us that correcting windspeeds would help with this bias.
Jordan has past experience with applying a transform function to do this statistical adjustment. The idea is that the distribution of GCM windspeeds in the contemporary time period should be similar to those in the NLDAS. Once that transform function is created, it can be applied to the future time periods.