Order of duplicate rows varies across machines

lekoenig commented 2 years ago

[The] order of duplicate rows may vary across machines, and so when we pick the first duplicate row as a way of "randomly" selecting a duplicate, we get different rows 😱

In the case of these two duplicate-sets, I am retaining a record with non-NA values for the result measure and result units, but on your machine, records with result units of NA are being retained. Records that are retained are those where flag_duplicate_drop_random is FALSE. I'm guessing we'll need an arrange step to account for this, although I'm not sure how best to do that since the rows are duplicated...

Originally posted by @lekoenig in https://github.com/USGS-R/ds-pipelines-targets-example-wqp/issues/80#issuecomment-1200028742

lekoenig commented 2 years ago

I think it makes sense to add a step to resolve_duplicates() that filters out duplicated rows where ResultMeasureValue and ResultMeasure.MeasureUnitCode are NA before selecting which duplicate record to retain.

However, that doesn't completely resolve this issue where we'd like the "randomly-selected" duplicate to be consistent across users/machines. Take this example:

> char_names_crosswalk <- tar_read(p1_char_names_crosswalk)
> 
> # define arguments from clean_wqp_data
> duplicate_definition = c('OrganizationIdentifier',
+                          'MonitoringLocationIdentifier',
+                          'ActivityStartDate', 
+                          'ActivityStartTime.Time',
+                          'CharacteristicName', 
+                          'ResultSampleFractionText')
> remove_duplicated_rows = FALSE
> 
> # define a function we want to use
> flag_duplicates <- function(wqp_data, duplicate_definition){
+   
+   # Flag duplicate records using the `duplicate_definition`
+   wqp_data_out <- wqp_data %>%
+     group_by(across(all_of(duplicate_definition))) %>% 
+     mutate(n_duplicated = n(),
+            flag_duplicated_row = n_duplicated > 1) %>% 
+     ungroup() %>%
+     select(-n_duplicated)
+   
+   return(wqp_data_out)
+   
+ }
> 
> # create a data subset to use for inspecting cleaning steps
> wqp_data <- tar_read(p3_wqp_data_aoi_formatted) %>%
+   filter(MonitoringLocationIdentifier == "USGS-01473030",
+          ActivityStartDate == "2003-05-12",
+          CharacteristicName == "Specific conductance",
+          !is.na(ResultMeasureValue), 
+          !is.na(ResultMeasure.MeasureUnitCode),
+          ActivityStartTime.Time == "10:20:00") %>%
+   select(MonitoringLocationIdentifier, ActivityStartDate, ActivityStartTime.Time,
+          ResultMeasureValue, ResultMeasure.MeasureUnitCode, 
+          DetectionQuantitationLimitMeasure.MeasureValue,ResultDetectionConditionText,
+          DetectionQuantitationLimitMeasure.MeasureUnitCode,
+          ResultCommentText, ResultTemperatureBasisText, 
+          any_of(duplicate_definition))
> 
> # now try stepping through clean_wqp_data and see if different records are getting
> # flagged for removal
> wqp_data %>%
+   # add in a row number to use for reconciling duplicates
+   arrange(across(all_of(duplicate_definition))) %>%
+   mutate(row_number = rownames(.)) %>%
+   left_join(y = char_names_crosswalk, by = c("CharacteristicName" = "char_name")) %>%
+   flag_duplicates(., duplicate_definition) %>%
+   # the code below is taken from remove_duplicates(). Let's check that the same 
+   # records are getting flagged for removal
+   group_by(across(all_of(duplicate_definition))) %>% 
+   # try using row number to arrange the rows
+   arrange(row_number) %>%
+   # in the case of duplicates, don't consider rows where result is NA
+   filter(!is.na(ResultMeasureValue), !is.na(ResultMeasure.MeasureUnitCode)) %>%
+   # To help resolve duplicates, randomly select the first record
+   # from each duplicated set and flag all others for exclusion.
+   mutate(n_duplicated = n(),
+          dup_number = seq(n_duplicated),
+          flag_duplicate_drop_random = n_duplicated > 1 & dup_number != 1) %>%
+   ungroup() %>%
+   select(MonitoringLocationIdentifier, ActivityStartDate, ActivityStartTime.Time,
+          ResultMeasureValue, ResultMeasure.MeasureUnitCode, CharacteristicName,
+          flag_duplicated_row, dup_number, row_number, flag_duplicate_drop_random)
# A tibble: 2 x 10
  MonitoringLocationIde~ ActivityStartDa~ ActivityStartTi~ ResultMeasureVa~ ResultMeasure.M~ CharacteristicN~ flag_duplicated~ dup_number row_number flag_duplicate_~
  <chr>                  <chr>            <chr>                       <dbl> <chr>            <chr>            <lgl>                 <int> <chr>      <lgl>           
1 USGS-01473030          2003-05-12       10:20:00                      336 uS/cm @25C       Specific conduc~ TRUE                      1 1          FALSE           
2 USGS-01473030          2003-05-12       10:20:00                      335 uS/cm @25C       Specific conduc~ TRUE                      2 2          TRUE            
>

On my machine the record with ResultMeasureValue == 336 uS/cm @25C would be retained, but on another machine the 335 value would be retained (with flag_duplicate_drop_random== FALSE). These two particular rows have different USGS Pcodes, so you'd need to use more complicated steps to resolve these duplicates, steps that I think are outside of the scope of this example pipeline.

I don't really know how I can enforce the same row order across machines using our duplicate_definition. Should we just include a comment that there is some stochasticity to this step and so different users might retain different records? Should we approach this "resolve duplicates" step a different way?

lekoenig commented 2 years ago

In #80, we arranged all rows within each duplicate set to enforce the same row order across machines. Noting here that we only discovered this issue because of the log files in 3_harmonize/log - I will keep this issue/PR in mind for when I need an example of how log files can be useful!

DOI-USGS / ds-pipelines-targets-example-wqp

Order of duplicate rows varies across machines #81