Closed padilla410 closed 2 years ago
~One thing that would be nice to have is a lake-to-state crosswalk. I think we could use that for the summary tables/figures mentioned above, but also it would be immediately useful to me in the GCM work. I need to get only MN lakes :)~
This was completed in #278
Great capture of a challenging part of this pipeline. I think lakes end up needing to be matched/crosswalked from four different types of sources:
For you problem 1, I think we have a mix of things that currently work pretty well (I think the polygon process is decent, and I think the WQP crosswalks make sense), are overdone/laborious (e.g., creating a new "Points" xwalk for a dataset of ~15 lakes seems excessive, and like Julie is mentioning, these tiny crosswalks add up), or are ignored/missing (this is probably the case for many of the "Points" datasets and probably all of the "Other" datasets. Lastly, a single site with a known NHDHR ID shouldn't need a separate file with this mapping - hopefully we'd be able to support coding that directly in as a variable in the parser file.
For your solution 1, there seems to be a difference between ad hoc ids or information we'd be using to link the lake vs actual IDs that the data contributor is using. If actual IDs are being used (LAGOS IDs, WBICs, DOWs are common examples), I agree we'd be doing a better service by exporting these linkages in a crosswalk that is part of the dataset. As an aside, I'd suggest that the crosswalk include the GNIS name, the state, and the county of the lake (see #278 for an element of this). Sometimes a provider has a naming scheme that is unique, but is probably internal and they don't have that many lakes, so exporting such a xwalk in a data release might be confusing and mostly empty. Lastly, some ways we're linking lakes may not need a persistent target/file with this matching (such as point in polygon, with the exception of WQP, which I think needs this matching because there can be so many unique sites within a lake). I'd like to think we can do this lower cost matching on the fly when we're merging things in your problem 2 and avoid a persistent xwalk for those altogether?
A larger crosswalk (includes state, county, GNIS name as well) with the major contributors or semi-official IDs is exported in data releases. We have some decisions regarding what belongs in this export.
For your problem 2 Yes, I think the current pattern of crosswalk_coop_dat
is not built to scale elegantly and handle an increasing number of unique crosswalks. I see three problems: 1) the function doesn't scale well with new sources (lotsa copy+paste and adding new arguments), 2) as you mentioned, it is hard to know what you changed or track down how and if your new data were added, and 3) the function as designed doesn't support adding sources without an explicit NHDRHR crosswalk, which as discussed above, is overkill for most small datasources and including that many unique crosswalks is a lot of work and would create a lot of small new targets.
I'd suggest breaking this problem out into two parts, one for the desired intermediate summary targets (what do those look like? at what stage(s) of the pipeline do they exist? how are they versioned and tracked?) and a second for the proposed overhaul of what crosswalk_coop_dat
does.
One way to handle the function is to modify the upstream data format to have each coop parser actually handle its own NHDHR matching or have the function add a column or something to the output to specify which function or file the downstream target that does the matching should use (e.g., specifying a certain function name means the function will expect certain columns to exist, such as lat/lon and specifying a crosswalk file means a simple join needs to be used). Then the crosswalk_coop_dat
would basically be a bind, an !is.na()
filter, and a de-duplication and summary function that wouldn't require you to add new code sections when new file types were included.
Lake-to-state xwalk is complete, see #278.
xwalk <- readRDS(sc_retrieve('2_crosswalk_munge/out/lake_to_state_xwalk.rds.ind'))
Over the past few months a number of small issues have occurred when working in the pipeline that have made debugging difficult and will ultimately result in code that is difficult to maintain.
The current issue is a work in progress. It represents my current understanding of what we would like to refactor based on a few meetings from 2022-01-04. Tagged assignees should edit this issue based on their current understanding of problems with the pipeline (or feel free to make corrections to my current understanding!). Problem definition should include why a portion of the pipeline needs refactoring, notes on how to refactor (if you've given it some thought), and desired outcomes from the refactor (simplified code? An output table? etc). The Lakes Team will then have a meeting to see if we are all on the same page and to discuss timeline for completion. After that meeting, I will update/finalize this issue and start the refactoring process.
Issues
Refactor crosswalks from
1_crosswalk_fetch/in
that will be used in7a_temp_coop_munge
Problem: Using the current pipeline configuration, each unique dataset munged in
7a_temp_coop_munge
needs a crosswalk in1_crosswalk_fetch\in
. This results in an ever expanding number of arguments being passed intocrosswalk_coop_dat
. The ever expanding list of arguments has two problems:crosswalk_coop_dat
(i.e., does the citizen science data set from Bull Shoals really need it's own crosswalk when it only contains one lake?) Normalizing crosswalks across datasets may make it possible to cut down on arguments tocrosswalk_coop_dat
, minimize duplicate work, and produce a crosswalk that links cooperator data tonhdhr
IDs which is a known desired output.Key Outcome:
nhdhr
id values that can be returned to cooperators as a data releaseUnderstanding the results of
crosswalk_coop_dat
in7a_temp_coop_munge
is difficult because there are no intermediate summary targetsProblem: Currently, understanding exactly what happens
crosswalk_coop_dat
is difficult. Sometimes a data set contains new lakes while other times a data set may be adding data for existing lakes. There are a few potential benefits to adding these intermediary summaries:Key Outcome:
7a_temp_coop_munge/all_coop_dat_linked.feather
nhdhr
idOther Key Questions
Key Areas of the Pipeline to Refactor