DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

Mega-Issue: Refactor Pipeline #271

Closed padilla410 closed 2 years ago

padilla410 commented 2 years ago

Over the past few months a number of small issues have occurred when working in the pipeline that have made debugging difficult and will ultimately result in code that is difficult to maintain.

The current issue is a work in progress. It represents my current understanding of what we would like to refactor based on a few meetings from 2022-01-04. Tagged assignees should edit this issue based on their current understanding of problems with the pipeline (or feel free to make corrections to my current understanding!). Problem definition should include why a portion of the pipeline needs refactoring, notes on how to refactor (if you've given it some thought), and desired outcomes from the refactor (simplified code? An output table? etc). The Lakes Team will then have a meeting to see if we are all on the same page and to discuss timeline for completion. After that meeting, I will update/finalize this issue and start the refactoring process.

Issues

Refactor crosswalks from 1_crosswalk_fetch/in that will be used in 7a_temp_coop_munge

Problem: Using the current pipeline configuration, each unique dataset munged in 7a_temp_coop_munge needs a crosswalk in 1_crosswalk_fetch\in. This results in an ever expanding number of arguments being passed into crosswalk_coop_dat. The ever expanding list of arguments has two problems:

Key Outcome:

Understanding the results ofcrosswalk_coop_dat in 7a_temp_coop_munge is difficult because there are no intermediate summary targets

Problem: Currently, understanding exactly what happens crosswalk_coop_dat is difficult. Sometimes a data set contains new lakes while other times a data set may be adding data for existing lakes. There are a few potential benefits to adding these intermediary summaries:

Key Outcome:

Other Key Questions

Key Areas of the Pipeline to Refactor

lindsayplatt commented 2 years ago

~One thing that would be nice to have is a lake-to-state crosswalk. I think we could use that for the summary tables/figures mentioned above, but also it would be immediately useful to me in the GCM work. I need to get only MN lakes :)~

This was completed in #278

jordansread commented 2 years ago

Great capture of a challenging part of this pipeline. I think lakes end up needing to be matched/crosswalked from four different types of sources:

  1. Polygons. These are the most "expensive" matches we do, but are pretty straightforward. A separate shapefile with the provider IDs is matched to the NHDHR shapes.
  2. Points. Lakes or sampling locations on lakes are provided usually as lat/lon (many coop data sources and all WQP data work this way)
  3. Single site. Big dumps of data from a single focus lakes come in. There isn't a specific linkage to the lake because it is implied in the file names or explicitly provided by the data contributor.
  4. Other. Lake data has some kind of alternate way of being defined as unique by the data provider. I think the most common pattern I've seen here is "name", "county", and "state", but otherwise no spatial information. These are hard to get a confident match and I don't think we've tried yet.

For you problem 1, I think we have a mix of things that currently work pretty well (I think the polygon process is decent, and I think the WQP crosswalks make sense), are overdone/laborious (e.g., creating a new "Points" xwalk for a dataset of ~15 lakes seems excessive, and like Julie is mentioning, these tiny crosswalks add up), or are ignored/missing (this is probably the case for many of the "Points" datasets and probably all of the "Other" datasets. Lastly, a single site with a known NHDHR ID shouldn't need a separate file with this mapping - hopefully we'd be able to support coding that directly in as a variable in the parser file.

For your solution 1, there seems to be a difference between ad hoc ids or information we'd be using to link the lake vs actual IDs that the data contributor is using. If actual IDs are being used (LAGOS IDs, WBICs, DOWs are common examples), I agree we'd be doing a better service by exporting these linkages in a crosswalk that is part of the dataset. As an aside, I'd suggest that the crosswalk include the GNIS name, the state, and the county of the lake (see #278 for an element of this). Sometimes a provider has a naming scheme that is unique, but is probably internal and they don't have that many lakes, so exporting such a xwalk in a data release might be confusing and mostly empty. Lastly, some ways we're linking lakes may not need a persistent target/file with this matching (such as point in polygon, with the exception of WQP, which I think needs this matching because there can be so many unique sites within a lake). I'd like to think we can do this lower cost matching on the fly when we're merging things in your problem 2 and avoid a persistent xwalk for those altogether?

revised suggestion for solution 2

A larger crosswalk (includes state, county, GNIS name as well) with the major contributors or semi-official IDs is exported in data releases. We have some decisions regarding what belongs in this export.

For your problem 2 Yes, I think the current pattern of crosswalk_coop_dat is not built to scale elegantly and handle an increasing number of unique crosswalks. I see three problems: 1) the function doesn't scale well with new sources (lotsa copy+paste and adding new arguments), 2) as you mentioned, it is hard to know what you changed or track down how and if your new data were added, and 3) the function as designed doesn't support adding sources without an explicit NHDRHR crosswalk, which as discussed above, is overkill for most small datasources and including that many unique crosswalks is a lot of work and would create a lot of small new targets.

I'd suggest breaking this problem out into two parts, one for the desired intermediate summary targets (what do those look like? at what stage(s) of the pipeline do they exist? how are they versioned and tracked?) and a second for the proposed overhaul of what crosswalk_coop_dat does.

One way to handle the function is to modify the upstream data format to have each coop parser actually handle its own NHDHR matching or have the function add a column or something to the output to specify which function or file the downstream target that does the matching should use (e.g., specifying a certain function name means the function will expect certain columns to exist, such as lat/lon and specifying a crosswalk file means a simple join needs to be used). Then the crosswalk_coop_dat would basically be a bind, an !is.na() filter, and a de-duplication and summary function that wouldn't require you to add new code sections when new file types were included.

lindsayplatt commented 2 years ago

Lake-to-state xwalk is complete, see #278.

xwalk <- readRDS(sc_retrieve('2_crosswalk_munge/out/lake_to_state_xwalk.rds.ind'))