DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

Adding crosswalks for SD and IN data #333

Closed padilla410 closed 2 years ago

padilla410 commented 2 years ago

Overview

This pull request addresses the issues ID'd in #267 - I'm adding missing crosswalks that caused IN and SD to fall out of the pipeline. There is a lot going on in this PR because I was checking as I went along. Here is the general layout:

~This work resulted in 101 additional PGDL lakes and 339 GLM lakes when compared to PR #327~ This work resulted in 27 additional PGDL lakes and 0 GLM lakes when compared to PR #328

Tagging Lindsay for the review, but @jread-usgs feel free to weigh in.

closes #267

Grand Summary

I was able to successfully add crosswalks for the datasets identified in #267.

When checking my work on parsers, I rely heavily on the dat_missing table - an internal table created in the crosswalk_coop_dat() function. Before my updates, the all_missing field was TRUE for all three of these datasets. After my updates all_missing == FALSE

To successfully run the code snippet below you need to to place browser() before the warning() near the end of crosswalk_coop_dat() in 7a_temp_coop_munge/src/parsing_task_fxns.R

Browse[2]> target_datasets <- c('7a_temp_coop_munge/tmp/SD_Lake_temp_export.rds',
+                      '7a_temp_coop_munge/tmp/Indiana_CLP_lakedata_1994_2013.rds',
+                      '7a_temp_coop_munge/tmp/Indiana_Glacial_Lakes_WQ_IN_DNR.rds')

Browse[2]> dat_missing %>% dplyr::filter(source %in% target_datasets)
# A tibble: 3 x 3
  source                                                     all_missing sum_missing
  <chr>                                                      <lgl>             <int>
1 7a_temp_coop_munge/tmp/Indiana_CLP_lakedata_1994_2013.rds  FALSE              4221
2 7a_temp_coop_munge/tmp/Indiana_Glacial_Lakes_WQ_IN_DNR.rds FALSE              4036
3 7a_temp_coop_munge/tmp/SD_Lake_temp_export.rds             FALSE              2355

When looking through the diff I suspect that you are going to see a number of "unexpected" rebuilds - I think most of these can be attributed to the recent factory reset on my computer. For example, I had to rebuild '7_config_merge/out/nml_list.rds.ind' in order to complete scmake('8_viz') because I did not have a local copy.

SD Crosswalk

This section includes the following verification steps for the SD data:

Verification that the scmake("1_crosswalk_fetch") created the correct sf object

This section confirms that the spatial points for the SD data set map in South Dakota.

library(mapview)

sd_points <- readRDS(sc_retrieve('1_crosswalk_fetch/out/SD_lake_pts_sf.rds.ind'))
mapview::mapview(sd_points, legend = F)

The result: image

Verification that scmake("2_crosswalk_munge") worked as expected

This section confirms that the SD crosswalk finds matches in NHDHR when munged together.

> sd_nhdhr <- readRDS(sc_retrieve('2_crosswalk_munge/out/SD_nhdhr_xwalk.rds.ind'))
> head(sd_nhdhr, 5)
# A tibble: 5 x 2
  site_id         SD_ID                 
  <chr>           <chr>                 
1 nhdhr_128633827 SD_SD-BA-L-FREEMAN_01 
2 nhdhr_128617117 SD_SD-BA-L-HAYES_01   
3 nhdhr_128625721 SD_SD-BA-L-MURDO_01   
4 nhdhr_128629047 SD_SD-BA-L-WAGGONER_01
5 nhdhr_154897883 SD_SD-BF-L-NEWELL_01  
> nrow(sd_nhdhr)
[1] 120 # <- not bad! 120/129 lakes had matches in NHDHR

IN Crosswalks

One crosswalk is used to create two NHDHR crosswalks. This is a necessary complication because one IN data set did have spatial data (Indiana_CLP_lakedata_1994_2013) while the other one didn't (Indiana_Glacial_Lakes_WQ_IN_DNR). These two datasets also use different lake reference systems (details here)

Verification that the scmake("1_crosswalk_fetch") created the correct sf object

For the Indiana datasets, there is a little more going on under the hood so there are a few more checks.

sf object verification

This section confirms that the spatial points for the two IN data sets map in Indiana.

library(mapview)

# map verification
## IN CLP
in_clp_sf <- readRDS(sc_retrieve('1_crosswalk_fetch/out/IN_CLP_lake_pts_sf.rds.ind'))
mapview::mapview(in_clp_sf, legend = F)

## IN DNR
in_dnr_sf <- readRDS(sc_retrieve('1_crosswalk_fetch/out/IN_DNR_lake_pts_sf.rds.ind'))
mapview::mapview(in_dnr_sf, legend = F)

IN CLP & IN DNR maps results (they're the same): image

site_id verification

This verification confirms that the same function in 1_crosswalk_fetch/src/fetch_crosswalk.R (fetch_IN_points) creates two different site_id values: one for each dataset.

> # IN CLP
> in_clp_sf <- readRDS(sc_retrieve('1_crosswalk_fetch/out/IN_CLP_lake_pts_sf.rds.ind'))
> head(in_clp_sf, 3)
Simple feature collection with 3 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -85.31443 ymin: 41.155 xmax: -84.97468 ymax: 41.23407
Geodetic CRS:  WGS 84
# A tibble: 3 x 5
  site_id   CLP_Lake_ID `Lake Name`          Lake_County_Name                       geometry
  <chr>           <int> <chr>                <chr>                               <POINT [°]>
1 IN_CLP_10          10 Cedarville Reservoir Cedarville Reservoir_Allen  (-85.0072 41.22422)
2 IN_CLP_14          14 Everett              Everett_Allen                (-85.31443 41.155)
3 IN_CLP_18          18 Hurshtown Reservoir  Hurshtown Reservoir_Allen  (-84.97468 41.23407)
> 
> # IN DNR
> in_dnr_sf <- readRDS(sc_retrieve('1_crosswalk_fetch/out/IN_DNR_lake_pts_sf.rds.ind'))
> head(in_dnr_sf, 3)
Simple feature collection with 3 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -85.31443 ymin: 41.155 xmax: -84.97468 ymax: 41.23407
Geodetic CRS:  WGS 84
# A tibble: 3 x 5
  site_id                           CLP_Lake_ID `Lake Name`          Lake_County_Name                     geometry
  <chr>                                   <int> <chr>                <chr>                             <POINT [°]>
1 IN_DNR_Cedarville Reservoir_Allen          10 Cedarville Reservoir Cedarville Reservoir_Al~  (-85.0072 41.22422)
2 IN_DNR_Everett_Allen                       14 Everett              Everett_Allen              (-85.31443 41.155)
3 IN_DNR_Hurshtown Reservoir_Allen           18 Hurshtown Reservoir  Hurshtown Reservoir_All~ (-84.97468 41.23407)

Verification that scmake("2_crosswalk_munge") worked as expected

This section confirms that both IN crosswalks find matches in NHDHR when munged together.

> # IN CLP NHDHR crosswalk
> in_clp_nhdhr <- readRDS(sc_retrieve('2_crosswalk_munge/out/IN_CLP_nhdhr_xwalk.rds.ind'))
> head(in_clp_nhdhr, 3)
# A tibble: 3 x 5
  site_id                                    IN_CLP_ID CLP_Lake_ID `Lake Name`         Lake_County_Name         
  <chr>                                      <chr>           <int> <chr>               <chr>                    
1 nhdhr_04549854-A785-4DB7-9FC1-A3DEEBB23C49 IN_CLP_14          14 Everett             Everett_Allen            
2 nhdhr_150995189                            IN_CLP_18          18 Hurshtown Reservoir Hurshtown Reservoir_Allen
3 nhdhr_58755758                             IN_CLP_35          35 Grouse Ridge        Grouse Ridge_Bartholomew 

> # IN DNR NHDHR crosswalk
> in_dnr_nhdhr <- readRDS(sc_retrieve('2_crosswalk_munge/out/IN_DNR_nhdhr_xwalk.rds.ind'))
> head(in_dnr_nhdhr, 3)
# A tibble: 3 x 5
  site_id                                    IN_DNR_ID                  CLP_Lake_ID `Lake Name` Lake_County_Name
  <chr>                                      <chr>                            <int> <chr>       <chr>           
1 nhdhr_04549854-A785-4DB7-9FC1-A3DEEBB23C49 IN_DNR_Everett_Allen                14 Everett     Everett_Allen   
2 nhdhr_150995189                            IN_DNR_Hurshtown Reservoi~          18 Hurshtown ~ Hurshtown Reser~
3 nhdhr_58755758                             IN_DNR_Grouse Ridge_Barth~          35 Grouse Rid~ Grouse Ridge_Ba~

Bonus fixes and weird stuff

Results from scmake("8_viz")

~After completing the above, we pick up 101 additional PGDL lakes and 339 GLM lakes (compared to PR #327).~ This work resulted in 27 additional PGDL lakes and 0 GLM lakes when compared to PR #328

Snapshot of 8_viz/out/lakes_summary_fig.html: image

lindsayplatt commented 2 years ago

Re: how many additional lakes were added. I think we need to compare to my additions in #329, which would actually mean 27 additional PGDL lakes and no additional GLM ones? Not to downplay the impact of these xwalk fixes ...

jordansread commented 2 years ago

(don't think this one is necessary to resolve) I first ran just scmake(), but bumped into an error at 7b_temp_merge/out/source_metadata_for_release.csv.ind

Yes, this target is a depends of all in the remake.yml but not included in dependencies necessary to build 8_viz, so this target often gets skipped because most of us have been in the habit of running scmake('8_viz'). This is a file we do use as part of data releases to attribute individual observations to their contributing org. But it isn't necessary to fix this right now and as I've mentioned in the past, I'm not sure I understand the use cases for require_local.

lindsayplatt commented 2 years ago

@jread-usgs I created #334 to remind us of this issue.

padilla410 commented 2 years ago

Lindsay and I worked through the differences between our two branches. She is missing some local files from 7a_temp_coop_munge/tmp/ that I am not missing. That explains the difference.

We verified that this PR represents the latest/greatest version or 7a_temp_coop_munge/out/all_coop_dat_linked.feather