Adds some text to the README, including authors and a note about harmonization and scope of the pipeline
Simplifies clean_wqp_data so that we only omit records that are exactly-duplicated across all columns
Regarding the third point above, we were previously defining duplicate records based on some combination of columns because this is what I have seen multiple projects do, including code from Meg and Jenny that I used as a guide when creating these functions. However, this definition is highly project-specific and I don't feel comfortable using that sort of framework in this template pipeline. I've left the old functions (flag_duplicates and remove_duplicates) in case an advanced user wants to pick those up and modify them for their own purposes, but I've added notes to each to clarify that we're not currently using them and instead only remove rows that are exact duplicates.
The diff of the records summary shows that by reconsidering what we call a duplicate, we pick up a decent number of samples, even in our tiny toy watershed that drives this example:
Note that I plan to tag v0.2.0 of the code after this PR gets merged, so I've gone ahead and updated the project changelog.
This PR makes the following changes:
clean_wqp_data
so that we only omit records that are exactly-duplicated across all columnsRegarding the third point above, we were previously defining duplicate records based on some combination of columns because this is what I have seen multiple projects do, including code from Meg and Jenny that I used as a guide when creating these functions. However, this definition is highly project-specific and I don't feel comfortable using that sort of framework in this template pipeline. I've left the old functions (
flag_duplicates
andremove_duplicates
) in case an advanced user wants to pick those up and modify them for their own purposes, but I've added notes to each to clarify that we're not currently using them and instead only remove rows that are exact duplicates.The diff of the records summary shows that by reconsidering what we call a duplicate, we pick up a decent number of samples, even in our tiny toy watershed that drives this example:
Note that I plan to tag v0.2.0 of the code after this PR gets merged, so I've gone ahead and updated the project changelog.
Closes #111