Adding Navico data and parser

padilla410 commented 2 years ago

This pull request is the second half of PR #247 and successfully runs the following components of the pipeline:

Fetch data ('scmake("6_temp_coop_fetch")`)
Munge data into correct format (scmake("7a_temp_coop_munge"))
Combine into one data set (scmake("7b_temp_merge"))
Plot to evaluate data changes (scmake("8_viz")) (<-- recognizing that this step alone would result in a complete build)

The main work here is in 7a_temp-coop_munge/src/crosswalk_fxns.R and a few tweaks in 7a_temp-coop_munge/src/data_parsers/parse_navico_data.R to get the pipeline to recognize the Navico data crosswalk in 2_crosswalk_munge/out/

Completing the pipeline run results in 17,167 records in 7a_coop_munge/out/all_coop_dat_linked.feather. All other datasets except two, appear to be identical (based on counts). The two datasets that are not identical are long_term_ntl.rds and ntl29_v10_0.rds: _(Note: in this case "n_old refers to all_coop_dat_linked.feather from 2021-12-08 and n_new refers to all_coop_dat_linked.feather from 2021-12-17)_

I assume the discrepancy in the table above is due to the NTL download issues described in issue #248 and resolved in PR #249.

When comparing all_coop_dat_linked.feather from 2021-12-08 to the same file from 2021-12-17, there are 895 new waterbodies that have corresponding NHD IDs. This seems reasonable based on the fact that there were 2,786 waterbodies in the original data set and 1,510 with corresponding NHD IDs. That being said, when comparing the html screenshot below to the other most recent HTML screenshot (from PR #239), it looks like we only gain 3 waterbodies (see PDGL Lake (n = xxx)).

Closes #219

Screenshot of updated map coverage (8_viz/out/lakes_summary_fig.html):

padilla410 commented 2 years ago

Update: 1,508 waterbodies not 1,510. based on the following snippet:

x <- read_feather(sc_retrieve('7a_temp_coop_munge/out/all_coop_dat_linked.feather.ind'))

x %>% 
  filter(source == '7a_temp_coop_munge/tmp/Waterbody_Temperatures_by_State.rds') %>%
  pull(site_id) %>% 
  unique() %>% 
  length()

lindsayplatt commented 2 years ago

Attempting to dig through this in chunks.

The two datasets that are not identical are long_term_ntl.rds and ntl29_v10_0.rds ... I assume the discrepancy in the table above is due to the NTL download issues described in issue #248 and resolved in PR #249.

Correct! In addition to fixing the "always stale" issues, #249 deleted the previous version of this NTL data (called long_term_ntl.rds) from 6_temp_coop_fetch/in and replaced with the newest available data from North Temperate Lakes LTER (called ntl29_v10_0.csv). The newer one has more data, so that checks out to me 👍

Checking on the counts next and then will scan through all the code!

lindsayplatt commented 2 years ago

I am struggling to get the right combo to narrow down which 3 lakes, but I am comfortable with the fact that you aren't seeing a huge uptick in the count of waterbodies that are model-able. This Navico data largely replaced other water temperature data that we already had (presumably it is of higher quality). In 7b_temp_merge.yml, all of the temp data is combined into one giant feather file, 7b_temp_merge/out/merged_temp_data_alltimes.feather.ind, using merge_temp_data(). THEN, the dataset is resolved into one feather file 7b_temp_merge/out/merged_temp_data_daily.feather.ind with a single observation for each lake, date, and depth using reduce_temp_data().

If we compare the before/after of reducing the giant temp data into one observation per lake/date/depth, we can start to see how the Navico data replaces a lot of data rather than adds more to our pipeline. So this is largely an increase in quality not quantity.

Here is how I started to explore this:

library(scipiper)
library(arrow)
library(tidyverse)
options(scipiper.dry_put = TRUE)

source_navico <- '7a_temp_coop_munge/tmp/Waterbody_Temperatures_by_State.rds'

# All temperatures merged contain the Navico data (17,132)
all_temp_data <- read_feather(sc_retrieve("7b_temp_merge/out/merged_temp_data_alltimes.feather.ind"))
all_temp_navico <- all_temp_data %>% filter(source == source_navico)
nrow(all_temp_navico)

# After the target the resolves multiple unique obs per lake/date/depth, there are 8,048 rows of Navico data
# in the final dataset. This confirms that it is, in fact, getting added but my assumption is that it
# is replacing other data, not adding new data
resolved_temp_data <- read_feather(sc_retrieve("7b_temp_merge/out/merged_temp_data_daily.feather.ind"))
resolved_temp_navico <- resolved_temp_data %>% filter(source == source_navico)
nrow(resolved_temp_navico)

# Keep only the Navico sites & days/depths that overlap with our recently added data
navico_overlap <- resolved_temp_navico %>% select(site_id, date, depth)
all_temp_shared_navico <- left_join(navico_overlap, all_temp_data)

# Here is the data that was replaced by the Navico data
overlap_summary <- all_temp_shared_navico %>% 
  group_by(site_id, date, depth) %>% 
  mutate(count = n()) %>% 
  ungroup() %>% 
  filter(count > 1)

head(overlap_summary)

You can see in the overlap summary how a single lake/date/depth combo can have multiple sources (the first two rows show both Navico AND another source of data, but there could be more).

# A tibble: 6 x 11
  site_id              date       depth time  timezone  temp source_id     source_site_id   source                  month count
  <chr>                <date>     <dbl> <chr> <chr>    <dbl> <chr>         <chr>            <chr>                   <dbl> <int>
1 nhdhr_{0BB28A37-D66~ 2020-05-29     0 14:15 EST       14.9 LRBOI_WQX-LM~ LRBOI_WQX-LMich  wqp_LRBOI_WQX-LMich         5     2
2 nhdhr_{0BB28A37-D66~ 2020-05-29     0 14:00 NA        16.5 Navico        NA               7a_temp_coop_munge/tmp~     5     2
3 nhdhr_120018754      2019-07-14     0 13:00 CDT       26.9 MNPCA-61-007~ MNPCA-61-0072-0~ wqp_MNPCA-61-0072-00-2~     7     2
4 nhdhr_120018754      2019-07-14     0 15:00 NA        27.4 Navico        NA               7a_temp_coop_munge/tmp~     7     2
5 nhdhr_120018981      2019-09-15     0 09:30 CDT       18.5 MNPCA-11-020~ MNPCA-11-0203-0~ wqp_MNPCA-11-0203-01-2~     9     6
6 nhdhr_120018981      2019-09-15     0 18:00 CDT       16.7 MNPCA-11-020~ MNPCA-11-0203-0~ wqp_MNPCA-11-0203-01-2~     9     6

Note that while the Navico has 17k+ observations, many of them were for the same lake/date/depth but at different times of the day. We are resolving into daily observations so we lose a bunch of quantity in this way, too. For example:

overlap_summary %>% filter(count == 14)

# A tibble: 14 x 11
   site_id        date       depth time  timezone  temp source_id     source_site_id source                         month count
   <chr>          <date>     <dbl> <chr> <chr>    <dbl> <chr>         <chr>          <chr>                          <dbl> <int>
 1 nhdhr_74926427 2018-05-20     0 00:00 NA        16.8 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
 2 nhdhr_74926427 2018-05-20     0 01:00 NA        15.8 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
 3 nhdhr_74926427 2018-05-20     0 02:00 NA        15.9 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
 4 nhdhr_74926427 2018-05-20     0 03:00 NA        16.0 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
 5 nhdhr_74926427 2018-05-20     0 13:00 NA        14.7 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
 6 nhdhr_74926427 2018-05-20     0 14:00 NA        15.3 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
 7 nhdhr_74926427 2018-05-20     0 15:00 NA        15.5 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
 8 nhdhr_74926427 2018-05-20     0 16:00 NA        16.3 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
 9 nhdhr_74926427 2018-05-20     0 17:00 NA        17.1 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
10 nhdhr_74926427 2018-05-20     0 18:00 NA        18.0 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
11 nhdhr_74926427 2018-05-20     0 19:00 NA        17.9 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
12 nhdhr_74926427 2018-05-20     0 21:00 NA        20.8 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
13 nhdhr_74926427 2018-05-20     0 22:00 NA        18.0 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14
14 nhdhr_74926427 2018-05-20     0 23:00 NA        18.9 Navico_915714 NA             7a_temp_coop_munge/tmp/Waterb~     5    14

padilla410 commented 2 years ago

@lindsayplatt thank you for the detailed review! I was unaware that this dataset was intended to replace a lot of the data currently in the pipeline.

padilla410 commented 2 years ago

options(scipiper.dry_put = TRUE)

Thank you for this example! We'd been having conversations around how to use dry_put but I did not understand how to implement. A lot of the troubleshooting and review I did was in a small, offline Rproj to avoid triggering an update of the data on google drive.

lindsayplatt commented 2 years ago

@lindsayplatt thank you for the detailed review! I was unaware that this dataset was intended to replace a lot of the data currently in the pipeline.

I am not sure if we knew how much would be new vs replacing other data when we started this, but maybe I am out of the loop on Navico. I am remembering not too much of a difference when the crosswalk was added here.

jordansread commented 2 years ago

weighing in late here:

I assume the discrepancy in the table above is due to the NTL download issues described in issue #248 and resolved in PR #249.

Yes, this is because the NTL dataset was replaced with a newer one, and since the program continues to collect data, we've added more rows

That being said, when comparing the html screenshot below to the other most recent HTML screenshot (from PR #239), it looks like we only gain 3 waterbodies (see PDGL Lake (n = xxx)).

I think we're gaining more than 3 here. You are right that we're gaining three new PGDL lakes, but the requirement for a PGDL lake in this map is to have 10 or more unique dates with temperature profiles (which means at least 5 depth temperature measurements or 1 for every two meters on average, whichever number is lower). Since this Navico dataset is only surface temperatures, adding anywhere between 0 to a handful of lakes in the PGDL category seems like what we'd expect, since we're not adding new profiles.

Perhaps if we compare the total number of lakes with any observations to the new total, we'd get a better sense of what this new dataset adds for coverage. I discussed this with @lindsayplatt yesterday, but I'd be surprised if most of the Navico data overlap with existing data we have (same lake, depth, and date) since these Navico data are recreational sources from fishing trips (probably biased to weekends) and the majority of other data were have is agency monitoring data (probably biased to weekdays) and I wouldn't expect a high percentage of overlaps.

DOI-USGS / lake-temperature-model-prep

Adding Navico data and parser #263