Mega-Issue: Refactor Pipeline

padilla410 commented 2 years ago

Over the past few months a number of small issues have occurred when working in the pipeline that have made debugging difficult and will ultimately result in code that is difficult to maintain.

The current issue is a work in progress. It represents my current understanding of what we would like to refactor based on a few meetings from 2022-01-04. Tagged assignees should edit this issue based on their current understanding of problems with the pipeline (or feel free to make corrections to my current understanding!). Problem definition should include why a portion of the pipeline needs refactoring, notes on how to refactor (if you've given it some thought), and desired outcomes from the refactor (simplified code? An output table? etc). The Lakes Team will then have a meeting to see if we are all on the same page and to discuss timeline for completion. After that meeting, I will update/finalize this issue and start the refactoring process.

Issues

Refactor crosswalks from `1_crosswalk_fetch/in` that will be used in `7a_temp_coop_munge`

Problem: Using the current pipeline configuration, each unique dataset munged in 7a_temp_coop_munge needs a crosswalk in 1_crosswalk_fetch\in. This results in an ever expanding number of arguments being passed into crosswalk_coop_dat. The ever expanding list of arguments has two problems:

There is currently a lot of repeated code within the function which will ultimately result in code that is difficult to maintain into the future.
A waterbody is not always unique to a dataset. For example, data for Bull Shoals Reservoir exists in at least three data sets (Navico #257, Univ of Missouri #269, and the citizen science Bull Shoals data set from PR #218). While not a problem per se, it again adds repetition to crosswalk_coop_dat (i.e., does the citizen science data set from Bull Shoals really need it's own crosswalk when it only contains one lake?) Normalizing crosswalks across datasets may make it possible to cut down on arguments to crosswalk_coop_dat, minimize duplicate work, and produce a crosswalk that links cooperator data to nhdhr IDs which is a known desired output.

Key Outcome:

A final crosswalk that maps local ID values to nhdhr id values that can be returned to cooperators as a data release

Understanding the results of`crosswalk_coop_dat` in `7a_temp_coop_munge` is difficult because there are no intermediate summary targets

Problem: Currently, understanding exactly what happens crosswalk_coop_dat is difficult. Sometimes a data set contains new lakes while other times a data set may be adding data for existing lakes. There are a few potential benefits to adding these intermediary summaries:

It will make it easier for team members to double check their own work
It will make it easier for other team members to approve PRs without digging into the guts of a given function
It will provide a framework for intermediary targets for future lakes work.

Key Outcome:

Add intermediate summary targets to the pipeline. Potential options:
- A summary table that inventories unique waterbodies were added by each dataset in 7a_temp_coop_munge/all_coop_dat_linked.feather
- maybe - A summary table that summarizes the number of data sets for each nhdhr id

Key Areas of the Pipeline to Refactor

to be finalized after group meeting

lindsayplatt commented 2 years ago

~One thing that would be nice to have is a lake-to-state crosswalk. I think we could use that for the summary tables/figures mentioned above, but also it would be immediately useful to me in the GCM work. I need to get only MN lakes :)~

This was completed in #278

jordansread commented 2 years ago

Great capture of a challenging part of this pipeline. I think lakes end up needing to be matched/crosswalked from four different types of sources:

Polygons. These are the most "expensive" matches we do, but are pretty straightforward. A separate shapefile with the provider IDs is matched to the NHDHR shapes.
Points. Lakes or sampling locations on lakes are provided usually as lat/lon (many coop data sources and all WQP data work this way)
Single site. Big dumps of data from a single focus lakes come in. There isn't a specific linkage to the lake because it is implied in the file names or explicitly provided by the data contributor.
Other. Lake data has some kind of alternate way of being defined as unique by the data provider. I think the most common pattern I've seen here is "name", "county", and "state", but otherwise no spatial information. These are hard to get a confident match and I don't think we've tried yet.

For you problem 1, I think we have a mix of things that currently work pretty well (I think the polygon process is decent, and I think the WQP crosswalks make sense), are overdone/laborious (e.g., creating a new "Points" xwalk for a dataset of ~15 lakes seems excessive, and like Julie is mentioning, these tiny crosswalks add up), or are ignored/missing (this is probably the case for many of the "Points" datasets and probably all of the "Other" datasets. Lastly, a single site with a known NHDHR ID shouldn't need a separate file with this mapping - hopefully we'd be able to support coding that directly in as a variable in the parser file.

For your solution 1, there seems to be a difference between ad hoc ids or information we'd be using to link the lake vs actual IDs that the data contributor is using. If actual IDs are being used (LAGOS IDs, WBICs, DOWs are common examples), I agree we'd be doing a better service by exporting these linkages in a crosswalk that is part of the dataset. As an aside, I'd suggest that the crosswalk include the GNIS name, the state, and the county of the lake (see #278 for an element of this). Sometimes a provider has a naming scheme that is unique, but is probably internal and they don't have that many lakes, so exporting such a xwalk in a data release might be confusing and mostly empty. Lastly, some ways we're linking lakes may not need a persistent target/file with this matching (such as point in polygon, with the exception of WQP, which I think needs this matching because there can be so many unique sites within a lake). I'd like to think we can do this lower cost matching on the fly when we're merging things in your problem 2 and avoid a persistent xwalk for those altogether?

revised suggestion for solution 2

A larger crosswalk (includes state, county, GNIS name as well) with the major contributors or semi-official IDs is exported in data releases. We have some decisions regarding what belongs in this export.

For your problem 2 Yes, I think the current pattern of crosswalk_coop_dat is not built to scale elegantly and handle an increasing number of unique crosswalks. I see three problems: 1) the function doesn't scale well with new sources (lotsa copy+paste and adding new arguments), 2) as you mentioned, it is hard to know what you changed or track down how and if your new data were added, and 3) the function as designed doesn't support adding sources without an explicit NHDRHR crosswalk, which as discussed above, is overkill for most small datasources and including that many unique crosswalks is a lot of work and would create a lot of small new targets.

I'd suggest breaking this problem out into two parts, one for the desired intermediate summary targets (what do those look like? at what stage(s) of the pipeline do they exist? how are they versioned and tracked?) and a second for the proposed overhaul of what crosswalk_coop_dat does.

One way to handle the function is to modify the upstream data format to have each coop parser actually handle its own NHDHR matching or have the function add a column or something to the output to specify which function or file the downstream target that does the matching should use (e.g., specifying a certain function name means the function will expect certain columns to exist, such as lat/lon and specifying a crosswalk file means a simple join needs to be used). Then the crosswalk_coop_dat would basically be a bind, an !is.na() filter, and a de-duplication and summary function that wouldn't require you to add new code sections when new file types were included.

lindsayplatt commented 2 years ago

Lake-to-state xwalk is complete, see #278.

xwalk <- readRDS(sc_retrieve('2_crosswalk_munge/out/lake_to_state_xwalk.rds.ind'))

DOI-USGS / lake-temperature-model-prep