DOI-USGS / ds-pipelines-targets-example-wqp

An example targets pipeline for pulling data from the Water Quality Portal (WQP)
Other
10 stars 14 forks source link

Simplify how duplicate records are handled in `clean_wqp_data` #112

Closed lekoenig closed 1 year ago

lekoenig commented 1 year ago

This PR makes the following changes:

  1. Formatting changes to the README
  2. Adds some text to the README, including authors and a note about harmonization and scope of the pipeline
  3. Simplifies clean_wqp_data so that we only omit records that are exactly-duplicated across all columns

Regarding the third point above, we were previously defining duplicate records based on some combination of columns because this is what I have seen multiple projects do, including code from Meg and Jenny that I used as a guide when creating these functions. However, this definition is highly project-specific and I don't feel comfortable using that sort of framework in this template pipeline. I've left the old functions (flag_duplicates and remove_duplicates) in case an advanced user wants to pick those up and modify them for their own purposes, but I've added notes to each to clarify that we're not currently using them and instead only remove rows that are exact duplicates.

The diff of the records summary shows that by reconsidering what we call a duplicate, we pick up a decent number of samples, even in our tiny toy watershed that drives this example:

diff

Note that I plan to tag v0.2.0 of the code after this PR gets merged, so I've gone ahead and updated the project changelog.

Closes #111