This PR adds code to the clean_wqp_data() function to flag true missing results as well as duplicated records. Both of these steps are inspired by data cleaning steps in the proxies example files in this repo.
I want to keep data cleaning steps fairly "light touch" in this example repo, but I know the step to identify and resolve duplicates can get much more complex in many WQP workflows (e.g. this effort to harmonize WQP data across many characteristics in the Delaware River Basin). To make the data cleaning code more modular and hopefully enable further development by individual users, I've split off the step to flag duplicates into a separate function, flag_duplicates(). I kind of struggled with a "right-sized" example for this repo. Just flagging the duplicates doesn't actually help the user to resolve those duplicate sets since we presumably would want to keep one record for each duplicate set. So I added another flag based on the assumption that we'd randomly drop all duplicates except for the first one. Our code doesn't currently do anything with this extra bit of information, and I'm on the fence as to whether to drop duplicates or just flag them in this workflow. I'd be interested to hear what other thoughts you have.
Here's what I get when I run the code:
> tar_load(p3_wqp_data_aoi_clean)
>
> p3_wqp_data_aoi_clean %>%
+ group_by(flag_missing_result) %>%
+ summarize(n = n())
# A tibble: 2 x 2
flag_missing_result n
<lgl> <int>
1 TRUE 1
2 NA 20265
>
> p3_wqp_data_aoi_clean %>%
+ group_by(flag_duplicated_row) %>%
+ summarize(n = n())
# A tibble: 2 x 2
flag_duplicated_row n
<lgl> <int>
1 TRUE 7808
2 NA 12458
>
This PR adds code to the
clean_wqp_data()
function to flag true missing results as well as duplicated records. Both of these steps are inspired by data cleaning steps in the proxies example files in this repo.I want to keep data cleaning steps fairly "light touch" in this example repo, but I know the step to identify and resolve duplicates can get much more complex in many WQP workflows (e.g. this effort to harmonize WQP data across many characteristics in the Delaware River Basin). To make the data cleaning code more modular and hopefully enable further development by individual users, I've split off the step to flag duplicates into a separate function,
flag_duplicates()
. I kind of struggled with a "right-sized" example for this repo. Just flagging the duplicates doesn't actually help the user to resolve those duplicate sets since we presumably would want to keep one record for each duplicate set. So I added another flag based on the assumption that we'd randomly drop all duplicates except for the first one. Our code doesn't currently do anything with this extra bit of information, and I'm on the fence as to whether to drop duplicates or just flag them in this workflow. I'd be interested to hear what other thoughts you have.Here's what I get when I run the code:
Closes #16