Closed padilla410 closed 2 years ago
Update: 1,508 waterbodies not 1,510. based on the following snippet:
x <- read_feather(sc_retrieve('7a_temp_coop_munge/out/all_coop_dat_linked.feather.ind'))
x %>%
filter(source == '7a_temp_coop_munge/tmp/Waterbody_Temperatures_by_State.rds') %>%
pull(site_id) %>%
unique() %>%
length()
Attempting to dig through this in chunks.
The two datasets that are not identical are long_term_ntl.rds and ntl29_v10_0.rds ... I assume the discrepancy in the table above is due to the NTL download issues described in issue #248 and resolved in PR #249.
Correct! In addition to fixing the "always stale" issues, #249 deleted the previous version of this NTL data (called long_term_ntl.rds
) from 6_temp_coop_fetch/in
and replaced with the newest available data from North Temperate Lakes LTER (called ntl29_v10_0.csv
). The newer one has more data, so that checks out to me 👍
Checking on the counts next and then will scan through all the code!
I am struggling to get the right combo to narrow down which 3 lakes, but I am comfortable with the fact that you aren't seeing a huge uptick in the count of waterbodies that are model-able. This Navico data largely replaced other water temperature data that we already had (presumably it is of higher quality). In 7b_temp_merge.yml
, all of the temp data is combined into one giant feather file, 7b_temp_merge/out/merged_temp_data_alltimes.feather.ind
, using merge_temp_data()
. THEN, the dataset is resolved into one feather file 7b_temp_merge/out/merged_temp_data_daily.feather.ind
with a single observation for each lake, date, and depth using reduce_temp_data()
.
If we compare the before/after of reducing the giant temp data into one observation per lake/date/depth, we can start to see how the Navico data replaces a lot of data rather than adds more to our pipeline. So this is largely an increase in quality not quantity.
Here is how I started to explore this:
library(scipiper)
library(arrow)
library(tidyverse)
options(scipiper.dry_put = TRUE)
source_navico <- '7a_temp_coop_munge/tmp/Waterbody_Temperatures_by_State.rds'
# All temperatures merged contain the Navico data (17,132)
all_temp_data <- read_feather(sc_retrieve("7b_temp_merge/out/merged_temp_data_alltimes.feather.ind"))
all_temp_navico <- all_temp_data %>% filter(source == source_navico)
nrow(all_temp_navico)
# After the target the resolves multiple unique obs per lake/date/depth, there are 8,048 rows of Navico data
# in the final dataset. This confirms that it is, in fact, getting added but my assumption is that it
# is replacing other data, not adding new data
resolved_temp_data <- read_feather(sc_retrieve("7b_temp_merge/out/merged_temp_data_daily.feather.ind"))
resolved_temp_navico <- resolved_temp_data %>% filter(source == source_navico)
nrow(resolved_temp_navico)
# Keep only the Navico sites & days/depths that overlap with our recently added data
navico_overlap <- resolved_temp_navico %>% select(site_id, date, depth)
all_temp_shared_navico <- left_join(navico_overlap, all_temp_data)
# Here is the data that was replaced by the Navico data
overlap_summary <- all_temp_shared_navico %>%
group_by(site_id, date, depth) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count > 1)
head(overlap_summary)
You can see in the overlap summary how a single lake/date/depth combo can have multiple sources (the first two rows show both Navico AND another source of data, but there could be more).
# A tibble: 6 x 11
site_id date depth time timezone temp source_id source_site_id source month count
<chr> <date> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <int>
1 nhdhr_{0BB28A37-D66~ 2020-05-29 0 14:15 EST 14.9 LRBOI_WQX-LM~ LRBOI_WQX-LMich wqp_LRBOI_WQX-LMich 5 2
2 nhdhr_{0BB28A37-D66~ 2020-05-29 0 14:00 NA 16.5 Navico NA 7a_temp_coop_munge/tmp~ 5 2
3 nhdhr_120018754 2019-07-14 0 13:00 CDT 26.9 MNPCA-61-007~ MNPCA-61-0072-0~ wqp_MNPCA-61-0072-00-2~ 7 2
4 nhdhr_120018754 2019-07-14 0 15:00 NA 27.4 Navico NA 7a_temp_coop_munge/tmp~ 7 2
5 nhdhr_120018981 2019-09-15 0 09:30 CDT 18.5 MNPCA-11-020~ MNPCA-11-0203-0~ wqp_MNPCA-11-0203-01-2~ 9 6
6 nhdhr_120018981 2019-09-15 0 18:00 CDT 16.7 MNPCA-11-020~ MNPCA-11-0203-0~ wqp_MNPCA-11-0203-01-2~ 9 6
Note that while the Navico has 17k+ observations, many of them were for the same lake/date/depth but at different times of the day. We are resolving into daily observations so we lose a bunch of quantity in this way, too. For example:
overlap_summary %>% filter(count == 14)
# A tibble: 14 x 11
site_id date depth time timezone temp source_id source_site_id source month count
<chr> <date> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <int>
1 nhdhr_74926427 2018-05-20 0 00:00 NA 16.8 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
2 nhdhr_74926427 2018-05-20 0 01:00 NA 15.8 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
3 nhdhr_74926427 2018-05-20 0 02:00 NA 15.9 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
4 nhdhr_74926427 2018-05-20 0 03:00 NA 16.0 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
5 nhdhr_74926427 2018-05-20 0 13:00 NA 14.7 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
6 nhdhr_74926427 2018-05-20 0 14:00 NA 15.3 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
7 nhdhr_74926427 2018-05-20 0 15:00 NA 15.5 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
8 nhdhr_74926427 2018-05-20 0 16:00 NA 16.3 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
9 nhdhr_74926427 2018-05-20 0 17:00 NA 17.1 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
10 nhdhr_74926427 2018-05-20 0 18:00 NA 18.0 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
11 nhdhr_74926427 2018-05-20 0 19:00 NA 17.9 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
12 nhdhr_74926427 2018-05-20 0 21:00 NA 20.8 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
13 nhdhr_74926427 2018-05-20 0 22:00 NA 18.0 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
14 nhdhr_74926427 2018-05-20 0 23:00 NA 18.9 Navico_915714 NA 7a_temp_coop_munge/tmp/Waterb~ 5 14
@lindsayplatt thank you for the detailed review! I was unaware that this dataset was intended to replace a lot of the data currently in the pipeline.
options(scipiper.dry_put = TRUE)
Thank you for this example! We'd been having conversations around how to use dry_put
but I did not understand how to implement. A lot of the troubleshooting and review I did was in a small, offline Rproj
to avoid triggering an update of the data on google drive.
@lindsayplatt thank you for the detailed review! I was unaware that this dataset was intended to replace a lot of the data currently in the pipeline.
I am not sure if we knew how much would be new vs replacing other data when we started this, but maybe I am out of the loop on Navico. I am remembering not too much of a difference when the crosswalk was added here.
weighing in late here:
I assume the discrepancy in the table above is due to the NTL download issues described in issue #248 and resolved in PR #249.
Yes, this is because the NTL dataset was replaced with a newer one, and since the program continues to collect data, we've added more rows
That being said, when comparing the html screenshot below to the other most recent HTML screenshot (from PR #239), it looks like we only gain 3 waterbodies (see PDGL Lake (n = xxx)).
I think we're gaining more than 3 here. You are right that we're gaining three new PGDL lakes, but the requirement for a PGDL lake in this map is to have 10 or more unique dates with temperature profiles (which means at least 5 depth temperature measurements or 1 for every two meters on average, whichever number is lower). Since this Navico dataset is only surface temperatures, adding anywhere between 0 to a handful of lakes in the PGDL category seems like what we'd expect, since we're not adding new profiles.
Perhaps if we compare the total number of lakes with any observations to the new total, we'd get a better sense of what this new dataset adds for coverage. I discussed this with @lindsayplatt yesterday, but I'd be surprised if most of the Navico data overlap with existing data we have (same lake, depth, and date) since these Navico data are recreational sources from fishing trips (probably biased to weekends) and the majority of other data were have is agency monitoring data (probably biased to weekdays) and I wouldn't expect a high percentage of overlaps.
This pull request is the second half of PR #247 and successfully runs the following components of the pipeline:
scmake("7a_temp_coop_munge")
)scmake("7b_temp_merge")
)scmake("8_viz")
) (<-- recognizing that this step alone would result in a complete build)The main work here is in
7a_temp-coop_munge/src/crosswalk_fxns.R
and a few tweaks in7a_temp-coop_munge/src/data_parsers/parse_navico_data.R
to get the pipeline to recognize the Navico data crosswalk in2_crosswalk_munge/out/
Completing the pipeline run results in 17,167 records in
7a_coop_munge/out/all_coop_dat_linked.feather
. All other datasets except two, appear to be identical (based on counts). The two datasets that are not identical arelong_term_ntl.rds
andntl29_v10_0.rds
: _(Note: in this case "n_old
refers toall_coop_dat_linked.feather
from 2021-12-08 andn_new
refers toall_coop_dat_linked.feather
from 2021-12-17)_I assume the discrepancy in the table above is due to the NTL download issues described in issue #248 and resolved in PR #249.
When comparing
all_coop_dat_linked.feather
from 2021-12-08 to the same file from 2021-12-17, there are 895 new waterbodies that have corresponding NHD IDs. This seems reasonable based on the fact that there were 2,786 waterbodies in the original data set and 1,510 with corresponding NHD IDs. That being said, when comparing the html screenshot below to the other most recent HTML screenshot (from PR #239), it looks like we only gain 3 waterbodies (see PDGL Lake (n = xxx)).Closes #219
Screenshot of updated map coverage (
8_viz/out/lakes_summary_fig.html
):