DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

Implement wind transform function for GCMs #353

Closed lindsayplatt closed 2 years ago

lindsayplatt commented 2 years ago

After a deep dive into reasons behind some of the warm bias we were seeing in GCM-driven GLM output, we had a hunch that lower windspeeds in the GCMs than are present in NLDAS may be the issue. We did a test by adding 1.5 m/s to the windspeeds just to see how that would impact them and saw a significant change. These outcomes have encouraged us that correcting windspeeds would help with this bias.

Jordan has past experience with applying a transform function to do this statistical adjustment. The idea is that the distribution of GCM windspeeds in the contemporary time period should be similar to those in the NLDAS. Once that transform function is created, it can be applied to the future time periods.

jordansread commented 2 years ago

Here is the approach I am working through now:

  1. Use all of the cell-based files as written for GLM (so these will be csvs). This avoids using any cells in the raw data that aren't used in modeling and this also makes sure we're doing apples to apples comparisons since at least for NLDAS, the raw data is hourly and the GLM data is daily. Because csvs are slow, I'm using data.table for all of the major stuff here.
  2. Filter down to only the lake-cell combos that appear in both datasets. NLDAS has a much larger footprint than the Notaro dataset (e.g., includes DRB) and those files were also written for all lakes, not just the ones that are model-able.
  3. Filter down data to only the overlapping time periods
  4. Create a global distribution of wind speeds for comparison and transformation, meaning one distribution for all space, all time, and all GCMs. This avoids location-specific of GCM-specific bias shifting that may remove important differences between GCMs.
  5. I was originally going to fit these to distributions and do the shift that way, but input from Andy has me convinced that it is safer to do a discrete statistical transform (see 6.)
  6. I'm binning NLDAS to 0.1 m/s intervals (but this is a parameter so can be changed as needed). I think this is a fine resolution for values to have going into the model. Each 0.1 bin will have a count of values (discrete frequency distribution) and the job of the transform will be to figure out the width of each Notaro bin necessary to contain the same frequency of wind values as each bin in the NLDAS. So if there are 100 total values (there are actually way more) and NLDAS has 3 of them between 0 and 0.1, we'll want to find the value of x such that the bin for Notaro between 0 and x contains 3% of the Notaro observations.
  7. After creating the shift bin locations, that will give us a bin-specific transform to apply to each Notaro value.

Initial global distributions for NLDAS (red) and Notaro (black)

image
jordansread commented 2 years ago

Here is an updated distribution with the more resolved bins I am using to get better intersections on where the Notaro cuts need to be, as well as filtering the dates to only those between '1981-01-01' and '2000-12-31', inclusive:

image

Note the large number of zeros

jordansread commented 2 years ago

Here is the calculated offsets for the Notaro dataset so that it aligns better with the distribution of NLDAS wind

image

This is based on bin cuts like this:

image

So we'd use low_bound with the cut() function, group_by those cut bins, and apply the offset.

hcorson-dosch-usgs commented 2 years ago

Wow those offsets are tricky to conceptualize visually. The method you laid out makes sense, but I don't think I'm tracking why the offset peaks at ~10 m/s windspeed. Does the steady increase in the offset up to that peak result from a need to flatten the peak of the GCM distribution? I suppose what's not captured in that figure is a sense of how many values fall within each cut bin.

edited to add whoops, that sent before I was done. I was also wondering - am I correct that the offset value is applied to each existing GCM value within each cut bin? e.g. a GCM value of 0.5 becomes 1.55 and a GCM value of 0.55 becomes 1.6? Or do all values within a given cut bin simply get reassigned to that new_val in your tibble, e.g., a GCM value of 0.5 becomes 1.05, and a GCM value of 0.55 becomes 1.05. From the upper plot, I think it's the former, where Notaro offset = new_val - low_bound, but I just wanted to make sure I'm understanding the transformation correctly.

This is what I'm imagining in my head: image

jordansread commented 2 years ago

do all values within a given cut bin simply get reassigned to that new_val in your tibble, e.g., a GCM value of 0.5 becomes 1.05, and a GCM value of 0.55 becomes 1.05

This is the correct interpretation of what was done. Anything within the bin gets new_val regardless of where along the bin it falls. This is due to discretizing the approach (binning vs solving for a continuous transform function).

As for the 10 m/s question, my interpretation of that is that ~11.5 m/s observations are very rare in the GCM data and somewhat more common in the NLDAS dataset. In order to increase the frequency of 11.5 m/2 windspeeds in the GCM data to match NLDAS, the largest shift needed to be applied to GCM values around 10 m/s.

I have an update now with this shift applied to the entire GCM dataset, and then binning it to the same bins as NLDAS for a distribution (I assume the bumps are an artifact of the discretization):

image

and the cumulative distribution is identical to the eye (at least for me):

image
hcorson-dosch-usgs commented 2 years ago

Ah, thank you for clarifying. That makes sense now. The new distribution looks good!

jordansread commented 2 years ago

@lindsayplatt a heads up on the methods here. I used the GCM files on Caldera, but am 99.999% sure they had a 1.5 m/s offset applied to them. So I subtracted that from all files before doing this distribution work. If someone is applying these offsets directly to those data without first removing the 1.5 m/s offsets, things aren't going to work out. And if that offset wasn't exactly 1.5, then I'd want to repeat this.

Here is how I applied the offset to an individual file. Offset is in the notaro_transform data.frame and I took a screenshot of what that looks like above. bin_max is 100 here and just meant to be way beyond the maximum expected value.

DT_wind <- data.table::fread(filepath)[time %between% date_range]
# note that we're SUBTRACTING THE OFFSET HERE!!!
DT_wind[, WindSpeed :=  WindSpeed - 1.5]

# this creates a factor that we can group by and apply the offset:
DT_wind[, bin_id := cut(WindSpeed, breaks = c(notaro_transform$low_bound, bin_max), right = FALSE)]
merged_offset_wind <- merge(DT_wind, DT_offsets, by = 'bin_id', all.x = TRUE)
merged_offset_wind[, WindSpeed := new_val]
merged_offset_wind[, c("low_bound", "new_val", "bin_id") := NULL]
lindsayplatt commented 2 years ago

@jordansread I think it may depend on what you mean by GCM files on Caldera. The ones in lake-temperature-model-prep would not have the +1.5m/s offset because I added that in during the lake-temperature-process-models GCM munging step by adding mutate(WindSpeed = WindSpeed + 1.5) right below this line.

jordansread commented 2 years ago

Yes @lindsayplatt I used the .csv files on Caldera in lake-temperature-process-models for the Notaro GCM data but used the .csv files from lake-temperature-model-prep for NLDAS for this work.

lindsayplatt commented 2 years ago

OK, then they are probably impacted because I didn't think we needed to re-run anything in that pipeline until after your transform fix. Was thinking that would end up in lake-temperature-model-prep instead. I think we can easily rebuild the p1_gcm_csvs target in the main branch so that you can work with the CSVs without that offset if you'd like.

jordansread commented 2 years ago

There is some bad DRY stuff in here, but I didn't have enough time to address that. Here is how I created this distribution test

# bin NLDAS wind into 0.1 m/s bins, calc frequency for each bin **BUT**
# will need to subset for only overlapping periods

# **BUT** will also need to filter to only lakes that xwalk to both sources

# 7_drivers_munge/out/7_drivers_munged.feather has all of the .csv files

library(data.table)
library(progress)
library(tidyverse)

nldas_xwalk <- readRDS('7_config_merge/out/nml_meteo_fl_values.rds')
# from rsync -avz  jread@caldera-dtn.cr.usgs.gov:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models/1_prep/in/lake_cell_tile_xwalk.csv ../lake_cell_tile_xwalk.csv
notaro_xwalk <- read_csv('/Volumes/ThunderBlade/lake_cell_tile_xwalk.csv')

merged_xwalk <- inner_join(nldas_xwalk, notaro_xwalk, by = 'site_id')

driver_files <- arrow::read_feather('7_drivers_munge/out/7_drivers_munged.feather') %>% 
  mutate(meteo_fl = basename(driverpath)) %>% 
  filter(meteo_fl %in% merged_xwalk$meteo_fl)

bin_w <- 0.1
bin_max <- 100
date_range <- c('1981-01-01', '2000-12-31')

binned_nldas <- data.table(bin_id = cut(seq(0, to = bin_max - bin_w/10, by = bin_w), 
                                        breaks = seq(0, to = bin_max, by = bin_w), right = FALSE),
                       bin_cnt = 0L)

pb <- progress_bar$new(total = length(driver_files$driverpath))

for (filepath in driver_files$driverpath){
  DT_wind <- data.table::fread(filepath)[time %between% date_range]

  DT_wind[, bin_id := cut(WindSpeed, breaks = seq(0, to = bin_max, by = bin_w), right = FALSE)]
  DT_cnt <- DT_wind[ , .(bin_cnt = length(WindSpeed)), by = bin_id]

  # the merge will create NA for bins that this file doesn't have values in, replace w/ 0:
  merged_nldas <- merge(DT_cnt, binned_nldas, by = 'bin_id', all.y = TRUE)[is.na(bin_cnt.x), bin_cnt.x := 0]
  binned_nldas <- merged_nldas[, bin_cnt := bin_cnt.x + bin_cnt.y][, c('bin_id','bin_cnt'), with =FALSE]
  pb$tick()
}

# access GCM data, bin to 0.01, re-write as bin counts

# from folder, 
# `rsync -avz --include '*/' --include 'GCM_GFDL_1981_2000_183*' --include 'GCM_MRI_1981_2000_183*' --include 'GCM_MIROC5_1981_2000_183*' --include 'GCM_CNRM_1981_2000_183*' --include 'GCM_ACCESS_1981_2000_183*' --include 'GCM_IPSL_1981_2000_183*' --exclude '*' jread@caldera-dtn.cr.usgs.gov:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models/1_prep/out/ ./`

# but now need to filter these!!
GCM_driver_files <- dir('/Volumes/ThunderBlade/Notaro_downscaled_GLM/', full.names = TRUE)
bin_w <- 0.005 # change as desired for higher sampling/precision of downscaled data, has impact on bin alignment for transform

binned_notaro <- data.table(bin_id = cut(seq(0, to = bin_max - bin_w/10, by = bin_w), 
                                         breaks = seq(0, to = bin_max, by = bin_w), right = FALSE),
                            bin_cnt = 0L)

pb <- progress_bar$new(total = length(GCM_driver_files))

for (filepath in GCM_driver_files){

  DT_wind <- data.table::fread(filepath)[time %between% date_range]
  # note that we're SUBTRACTING THE OFFSET HERE!!!
  DT_wind[, WindSpeed :=  WindSpeed - 1.5]

  DT_wind[, bin_id := cut(WindSpeed, breaks = seq(0, to = bin_max, by = bin_w), right = FALSE)]
  DT_cnt <- DT_wind[ , .(bin_cnt = length(WindSpeed)), by = bin_id]

  # the merge will create NA for bins that this file doesn't have values in, replace w/ 0:
  merged_notaro <- merge(DT_cnt, binned_notaro, by = 'bin_id', all.y = TRUE)[is.na(bin_cnt.x), bin_cnt.x := 0]
  binned_notaro <- merged_notaro[, bin_cnt := bin_cnt.x + bin_cnt.y][, c('bin_id','bin_cnt'), with =FALSE]
  pb$tick()
}

binned_notaro$notaro_bin_total <- cumsum(binned_notaro$bin_cnt)/sum(binned_notaro$bin_cnt)

binned_ranked <- binned_nldas %>% mutate(nldas_bin_total = cumsum(bin_cnt)/sum(bin_cnt)) %>% 
  rowwise() %>% mutate(notaro_bin = which(binned_notaro$notaro_bin_total >= nldas_bin_total)[1]) %>% 
  ungroup() %>% mutate(low_bound = 0, new_val = 0)

for (j in 2:nrow(binned_ranked)){

  # get the cut bin from the distribution we're transforming _from_
  lower_cut_text <- as.character(binned_notaro$bin_id)[binned_ranked$notaro_bin[j-1]]
  lower_bnd <- unlist(strsplit(gsub("(?![,.])[[:punct:]]", "", as.character(lower_cut_text), perl=TRUE), ","))[2] %>% 
    as.numeric() %>% {. - bin_w}

  binned_ranked$low_bound[j] <- lower_bnd

  # get the middle of the cut bin from the distribution we're transforming _to_
  # this code takes a factor value like this: "[0.1,0.2)" and calculates the mean of the bins, so you'd get 0.15 in this case:
  binned_ranked$new_val[j] <- unlist(strsplit(gsub("(?![,.])[[:punct:]]", "", as.character(binned_ranked$bin_id[j]), perl=TRUE), ","))[1:2] %>% 
    as.numeric() %>% mean
}

notaro_transform <- binned_ranked %>%
  filter(bin_cnt > 0) %>% group_by(low_bound) %>% summarise(new_val = sum(new_val * bin_cnt) / sum(bin_cnt))

saveRDS(notaro_transform, '/Volumes/ThunderBlade/notaro_wind_transform.rds')

See above for how this can be applied to a file

lindsayplatt commented 2 years ago

I know that you did this application using the lake-temperature-process-models GCM files, but I am wondering if we want to include this here instead? That way, the same transform exists for the driver files that Andy might use?