Decide on binning/interpolation approach for matching predictions to observations

hcorson-dosch-usgs commented 2 years ago

See Jordan's note on the NLDAS surface evaluation PR here

Our team is currently using two approaches to match predictions to observations: 1) Interpolate predictions to the depths of observations. The 5_evaluate portion of this pipeline uses this method, based on Alison's depth-matching code that was in the mntoha data release.

Con: it requires 2+ observations on a given date, which means we lose single observation dates.

2) Assign observations to the nearest depth bin and match predictions to observations based on those depth bins. This is the approach used by Andy in the lake-temperature-lstm-static repo

Con: Error may be introduced if the depth bins are too coarse

Once we decide on an approach, the same approach should be use across both workflows.

AndyMcAliley commented 2 years ago

After discussion, Hayley and I decided to take approach 1: interpolating predictions to match observation depths. Here's why:

The training set for the LSTM is formed by assigning observations to depth bins (nearest neighbor interpolation). If we evaluated the model by interpolating observations in the same way, then the model evaluation wouldn't account for the error introduced by that interpolation. a model evaluation using approach 1, on the other hand, will include that extra error. It's more rigorous to keep model training and model evaluation methods independent.
The "Con" for approach 1 above doesn't actually apply: it only requires 1 observation on a given date because predictions are interpolated, not observations.
To me, it makes more sense to evaluate a models ability to reproduce observations as they are, not interpolated observations.

lindsayplatt commented 2 years ago

As shared via chat from Hayley on 7/15, she has changed the method in a branch on her fork here https://github.com/hcorson-dosch/lake-temperature-process-models/blob/extrapolate_preds/5_evaluate/src/eval_utility_fxns.R#L61-L77. I ran tar_make_clustermq(p5_nldas_pred_obs_csv, workers=50) on her hayley_extrapolate_preds branch on Tallgrass today (it took ~ 2 hrs to run). Then, compared the previous matching output, 5_evaluate/out/NLDAS_matched_to_observations_PredsNotExtrapolated.csv, to the most recent build. The following are the differences

# Compare the matching preds to obs methods.
library(tidyverse)

match_old <- readr::read_csv('5_evaluate/out/NLDAS_matched_to_observations_PredsNotExtrapolated.csv')
match_new <- readr::read_csv('5_evaluate/out/NLDAS_matched_to_observations.csv')

# How many more obvservations are kept?
nrow(match_new) - nrow(match_old)
[1] 19372

match_old_site <- match_old %>% group_by(site_id) %>% summarize(n_old = n())
match_new_site <- match_new %>% group_by(site_id) %>% summarize(n_new = n())

match_compare_site <- full_join(match_old_site, match_new_site) %>% 
  mutate(diff = n_new - n_old) %>% 
  filter(diff != 0)

# Across how many sites were the new observations spread? 
# For context, there were 3688 sites total
nrow(match_compare_site) 
[1] 1609

# On average, there are 12 new observations per site
# 
summary(match_compare_site)
   site_id              n_old             n_new            diff       
 Length:1609        Min.   :    7.0   Min.   :   12   Min.   :  1.00  
 Class :character   1st Qu.:  118.0   1st Qu.:  123   1st Qu.:  2.00  
 Mode  :character   Median :  247.0   Median :  255   Median :  4.00  
                    Mean   :  793.9   Mean   :  806   Mean   : 12.04  
                    3rd Qu.:  647.0   3rd Qu.:  671   3rd Qu.:  9.00  
                    Max.   :63268.0   Max.   :63271   Max.   :838.00 

# What's that site with a crazy number of additional new obs?
match_compare_site %>% filter(diff > 700)
# A tibble: 1 × 4
  site_id        n_old n_new  diff
  <chr>          <int> <int> <int>
1 nhdhr_45385910  6675  7513   838

plot(match_compare_site$diff)

hcorson-dosch-usgs commented 2 years ago

Thanks so much for wrapping up that work and writing this summary, Lindsay! Should I go ahead and PR that branch, so that this method is used when we complete the updated GCM runs?

lindsayplatt commented 2 years ago

Oh, yes! Please do! Watch out because the current process-models is on my branch for my 1.5 m/s test. You may need to git stash first before switching branches.

DOI-USGS / lake-temperature-process-models

Decide on binning/interpolation approach for matching predictions to observations #59