DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

17 missing sites from NML list? #339

Closed lindsayplatt closed 2 years ago

lindsayplatt commented 2 years ago

Hayley pulled down the recent nml_list.rds by running scmake('7_config_merge/out/nml_list.rds', 'getters.yml') on 5/10/2022 and found that there are 17 sites that used to be in the list and are no longer there (note that there are 3 that were added). I spent a bit of time this morning seeing what could be the issue and my sleuthing identified a couple of spots where there were semi-recent changes that could explain this difference. The main issues might be coming from changes in 7_config_merge/out/nml_H_A_values.rds.ind (here is a link to the lines with recent changes).

Main question: should we be concerned that 17 sites are no longer included in the NML list? I would expect that our recent efforts would have only added, not subtracted.

library(tidyverse)
library(sf)

dropped_sites <- c("nhdhr_{09E09633-2D9C-40F7-9F89-36ED6CF2A394}", "nhdhr_{3CB73C80-7D61-484A-9B02-D881B3B366E6}",
                   "nhdhr_{503D7EB2-93D3-47F8-8613-5482F66B5C8A}", "nhdhr_{507508B4-D0CB-4DB4-A9AD-D0E45CDC6646}",
                   "nhdhr_{63D127FC-DD2D-453B-B57F-45B323A72C51}", "nhdhr_{74141881-3AF2-48DD-80A2-68A589A99B21}",
                   "nhdhr_{83CB1E97-A4A4-4F19-8527-0DCCB33EA81F}", "nhdhr_{E0B4C1E3-BF0A-485B-8BAA-16414B0FDA18}",
                   "nhdhr_107071188", "nhdhr_107071284", "nhdhr_117649999", "nhdhr_120019863",
                   "nhdhr_120020396", "nhdhr_126210325", "nhdhr_86443975", "nhdhr_91688597",
                   "nhdhr_91692063")

# All 17 dropped sites have areas
lake_areas <- readRDS(scipiper::sc_retrieve('7_config_merge/out/canonical_lakes_area.rds.ind'))
lake_areas %>% filter(site_id %in% dropped_sites)

# All 17 dropped sites have elevations
lake_elev <- arrow::read_feather(scipiper::sc_retrieve("1_crosswalk_fetch/out/ned_centroid_elevations.feather.ind"))
lake_elev %>% filter(site_id %in% dropped_sites)

# Only 6 of the 17 appear in our resulting H_A list (11 dropped)
# Only changes to this here: https://github.com/USGS-R/lake-temperature-model-prep/blame/main/7_config_merge.yml#L48-L60
nml_ha <- readRDS(scipiper::sc_retrieve("7_config_merge/out/nml_H_A_values.rds.ind"))
sum(dropped_sites %in% names(nml_ha))

# 6 of the 17 sites continue for depths
nml_depth <- readRDS(scipiper::sc_retrieve("7_config_merge/out/nml_lake_depth_values.rds.ind"))
nml_depth %>% filter(site_id %in% dropped_sites)

# 6 of the 17 continue for layers
nml_layer <- readRDS(scipiper::sc_retrieve("7_config_merge/out/nml_layer_thick_values.rds.ind"))
nml_layer %>% filter(site_id %in% dropped_sites)

# 6 of the 17 sites continue for meteo files
nml_meteo <- readRDS(scipiper::sc_retrieve("7_config_merge/out/nml_meteo_fl_values.rds.ind"))
nml_meteo %>% filter(site_id %in% dropped_sites)

# 11 of the 17 (the opposite of what is dropped eariler)
nml_kw <- readRDS(scipiper::sc_retrieve("7_config_merge/out/nml_Kw_values.rds.ind"))
nml_kw %>% filter(site_id %in% dropped_sites)

# None of the sites appear in the final output, as expected since an inner join
# of kw and any of the others that had 6 of the sites do not overlap.
nml_list <- readRDS(scipiper::sc_retrieve("7_config_merge/out/nml_list.rds.ind"))
sum(dropped_sites %in% names(nml_list))
lindsayplatt commented 2 years ago

I'd like to draw @jread-usgs and @padilla410 's attention to this issue.

hcorson-dosch-usgs commented 2 years ago

Thanks, @lindsayplatt for documenting this. Just wanted to note that this only includes Minnesota lakes.

jordansread commented 2 years ago

Is this 17 lakes out of ~3k+? Or 17 out of the 881 TOHA lakes in MN?

@lindsayplatt do you think we lost these when going to LAGOS US (from LAGOS NE)? Alternatively there is a chance we lost some in the poly-to-poly matching with the changes to PROJ/sf that we've noted elsewhere. But I see all of these 17 have matches to MNDOWs, so unlikely a crosswalking issue.

hcorson-dosch-usgs commented 2 years ago

Previously, when I subset the 14,678 MN lakes in our pipeline ('2_crosswalk_munge/out/centroid_lakes_sf.rds' filtered to MN lakes using '2_crosswalk_munge/out/lake_to_state_xwalk.rds') down to those that are in the nml list ('7_config_merge/out/nml_list.rds'), I ended up with 3,583 lakes. I now end up with 3,569 lakes. Per Lindsay's notes above, there are 3 lakes in the new subset that were not in the old subset (created with an earlier version of the nml list), and 17 lakes that were in the old subset that are not in the new subset.

jordansread commented 2 years ago
arrow::read_feather(scipiper::sc_retrieve('7b_temp_merge/out/temp_data_with_sources.feather.ind')) %>% 
+     filter(site_id %in% dropped_sites) %>% group_by(site_id) %>% tally %>% arrange(desc(n))
# A tibble: 17 × 2
   site_id                                          n
   <chr>                                        <int>
 1 nhdhr_107071284                                551
 2 nhdhr_{83CB1E97-A4A4-4F19-8527-0DCCB33EA81F}   183
 3 nhdhr_91688597                                 175
 4 nhdhr_{E0B4C1E3-BF0A-485B-8BAA-16414B0FDA18}   159
 5 nhdhr_120020396                                131
 6 nhdhr_107071188                                110
 7 nhdhr_91692063                                  88
 8 nhdhr_{507508B4-D0CB-4DB4-A9AD-D0E45CDC6646}    70
 9 nhdhr_{74141881-3AF2-48DD-80A2-68A589A99B21}    45
10 nhdhr_126210325                                 33
11 nhdhr_{09E09633-2D9C-40F7-9F89-36ED6CF2A394}    23
12 nhdhr_{503D7EB2-93D3-47F8-8613-5482F66B5C8A}    14
13 nhdhr_{3CB73C80-7D61-484A-9B02-D881B3B366E6}     7
14 nhdhr_{63D127FC-DD2D-453B-B57F-45B323A72C51}     6
15 nhdhr_117649999                                  6
16 nhdhr_120019863                                  3
17 nhdhr_86443975                                   1

☝️ the number of temperature observations each of these sites has

Some more thoughts: I think perhaps the 11 sites (probably more beyond MN) were lost with the shift from LAGOS NE to LAGOS US. We went from a poly-to-poly LAGOS match to a direct nhdHR PERM_ID match when moving to US. So there is a potential that our poly match was greedier and some of the false matches were dropped. There also may be depths that were in LAGOS NE but were removed in the updated US version due to QAQC issues.

I am less certain why we'd lose 6 MN lakes that had clarity estimates for but no longer do. Most of the Kw/Kd/Secchi data is coming from the WQP and each time we pull there are some lost measurements but mostly new gained ones. Sometimes the organizations remove errant or redundant data or remove monitoring locations completely from STORET/WQP. This is a major challenge for reproducibility, but isn't something we have control over (and it is nearly always a net benefit to update the data as opposed to lock in a WQP pull from a few years ago to avoid this issue). WQP was last pulled about 6 months ago, so we do have the option of doing an updated pull sometime soon before the final model runs. The downside is that some of the Kw values would changes for some of the already-modeled MN lakes in the projections. Something to consider carefully (although if the GCM pipeline is careful at the parameter stage, it would only re-run lakes with updated nml params vs all of them).

Additionally, because the nml_lake_depth target is calculated from the nml_H_A target, and the nml_H_A target is calculated by a combination of hypsography and z_max values, it is likely that the 11 lakes lost that Lindsay shared above are lost at that target because of no longer having z_max (just wanted to clarify that because the order of the targets might suggest otherwise).

lindsayplatt commented 2 years ago

Thanks for these explanations, @jread-usgs. So what I am hearing is that these losses are likely not due to our error in any of the recent PRs, but instead likely an artifact of using LAGOS-US now (which does seem reasonable). I am personally OK with not investigating this issue further and being comfortable with our new set being the best available and moving forward. I also would like to skip a new clarity pull since we have a lot of other things going on (even though it may be fairly simple).