Closed lekoenig closed 2 years ago
I think it makes sense to add a step to resolve_duplicates()
that filters out duplicated rows where ResultMeasureValue
and ResultMeasure.MeasureUnitCode
are NA
before selecting which duplicate record to retain.
However, that doesn't completely resolve this issue where we'd like the "randomly-selected" duplicate to be consistent across users/machines. Take this example:
> char_names_crosswalk <- tar_read(p1_char_names_crosswalk)
>
> # define arguments from clean_wqp_data
> duplicate_definition = c('OrganizationIdentifier',
+ 'MonitoringLocationIdentifier',
+ 'ActivityStartDate',
+ 'ActivityStartTime.Time',
+ 'CharacteristicName',
+ 'ResultSampleFractionText')
> remove_duplicated_rows = FALSE
>
> # define a function we want to use
> flag_duplicates <- function(wqp_data, duplicate_definition){
+
+ # Flag duplicate records using the `duplicate_definition`
+ wqp_data_out <- wqp_data %>%
+ group_by(across(all_of(duplicate_definition))) %>%
+ mutate(n_duplicated = n(),
+ flag_duplicated_row = n_duplicated > 1) %>%
+ ungroup() %>%
+ select(-n_duplicated)
+
+ return(wqp_data_out)
+
+ }
>
> # create a data subset to use for inspecting cleaning steps
> wqp_data <- tar_read(p3_wqp_data_aoi_formatted) %>%
+ filter(MonitoringLocationIdentifier == "USGS-01473030",
+ ActivityStartDate == "2003-05-12",
+ CharacteristicName == "Specific conductance",
+ !is.na(ResultMeasureValue),
+ !is.na(ResultMeasure.MeasureUnitCode),
+ ActivityStartTime.Time == "10:20:00") %>%
+ select(MonitoringLocationIdentifier, ActivityStartDate, ActivityStartTime.Time,
+ ResultMeasureValue, ResultMeasure.MeasureUnitCode,
+ DetectionQuantitationLimitMeasure.MeasureValue,ResultDetectionConditionText,
+ DetectionQuantitationLimitMeasure.MeasureUnitCode,
+ ResultCommentText, ResultTemperatureBasisText,
+ any_of(duplicate_definition))
>
> # now try stepping through clean_wqp_data and see if different records are getting
> # flagged for removal
> wqp_data %>%
+ # add in a row number to use for reconciling duplicates
+ arrange(across(all_of(duplicate_definition))) %>%
+ mutate(row_number = rownames(.)) %>%
+ left_join(y = char_names_crosswalk, by = c("CharacteristicName" = "char_name")) %>%
+ flag_duplicates(., duplicate_definition) %>%
+ # the code below is taken from remove_duplicates(). Let's check that the same
+ # records are getting flagged for removal
+ group_by(across(all_of(duplicate_definition))) %>%
+ # try using row number to arrange the rows
+ arrange(row_number) %>%
+ # in the case of duplicates, don't consider rows where result is NA
+ filter(!is.na(ResultMeasureValue), !is.na(ResultMeasure.MeasureUnitCode)) %>%
+ # To help resolve duplicates, randomly select the first record
+ # from each duplicated set and flag all others for exclusion.
+ mutate(n_duplicated = n(),
+ dup_number = seq(n_duplicated),
+ flag_duplicate_drop_random = n_duplicated > 1 & dup_number != 1) %>%
+ ungroup() %>%
+ select(MonitoringLocationIdentifier, ActivityStartDate, ActivityStartTime.Time,
+ ResultMeasureValue, ResultMeasure.MeasureUnitCode, CharacteristicName,
+ flag_duplicated_row, dup_number, row_number, flag_duplicate_drop_random)
# A tibble: 2 x 10
MonitoringLocationIde~ ActivityStartDa~ ActivityStartTi~ ResultMeasureVa~ ResultMeasure.M~ CharacteristicN~ flag_duplicated~ dup_number row_number flag_duplicate_~
<chr> <chr> <chr> <dbl> <chr> <chr> <lgl> <int> <chr> <lgl>
1 USGS-01473030 2003-05-12 10:20:00 336 uS/cm @25C Specific conduc~ TRUE 1 1 FALSE
2 USGS-01473030 2003-05-12 10:20:00 335 uS/cm @25C Specific conduc~ TRUE 2 2 TRUE
>
On my machine the record with ResultMeasureValue == 336 uS/cm @25C
would be retained, but on another machine the 335 value would be retained (with flag_duplicate_drop_random== FALSE
). These two particular rows have different USGS Pcodes, so you'd need to use more complicated steps to resolve these duplicates, steps that I think are outside of the scope of this example pipeline.
I don't really know how I can enforce the same row order across machines using our duplicate_definition
. Should we just include a comment that there is some stochasticity to this step and so different users might retain different records? Should we approach this "resolve duplicates" step a different way?
In #80, we arranged all rows within each duplicate set to enforce the same row order across machines. Noting here that we only discovered this issue because of the log files in 3_harmonize/log
- I will keep this issue/PR in mind for when I need an example of how log files can be useful!
[The] order of duplicate rows may vary across machines, and so when we pick the first duplicate row as a way of "randomly" selecting a duplicate, we get different rows 😱
In the case of these two duplicate-sets, I am retaining a record with non-NA values for the result measure and result units, but on your machine, records with result units of
NA
are being retained. Records that are retained are those whereflag_duplicate_drop_random
isFALSE
. I'm guessing we'll need an arrange step to account for this, although I'm not sure how best to do that since the rows are duplicated...Originally posted by @lekoenig in https://github.com/USGS-R/ds-pipelines-targets-example-wqp/issues/80#issuecomment-1200028742