Occasional test failure - TADA_FindPotentialDuplicatesMultipleOrgs does not grow dataset

USEPA / EPATADA

This R package can be used to compile and evaluate Water Quality Portal (WQP) data for samples collected from surface water monitoring sites on streams and lakes. It can be used to create applications that support water quality programs and help states, tribes, and other stakeholders efficiently analyze the data.

https://usepa.github.io/EPATADA/

Creative Commons Zero v1.0 Universal

40 stars 18 forks source link

Occasional test failure - TADA_FindPotentialDuplicatesMultipleOrgs does not grow dataset #513

Closed hillarymarler closed 1 month ago

hillarymarler commented 1 month ago

We have noticed occasional failures of this test, although TADA_FindPotentialDuplicatesMultipleOrgs has not been edited recently.

The solution for this issue will require finding example data sets which cause this failure and modifying TADA_FindPotentialDuplicatesMultipleOrgs to address those scenarios.

hillarymarler commented 1 month ago

Example data set that will fail this test:

df <- TADA_DataRetrieval(startDate = "2006-07-17",
endDate =  "2006-07-18",
statecode =  "DE")

hillarymarler commented 1 month ago

Additional data sets that cause test failures for testing:

df2 <- TADA_DataRetrieval(startDate =  "2023-02-14",
endDate = "2023-02-15",
statecode =  "CO")

df3 <- TADA_DataRetrieval(startDate = "2010-11-30",
endDate =  "2010-12-01",
statecode = "AL" )

hillarymarler commented 1 month ago

@wokenny13 - I think the extra rows are being added in situations where records from the same organization are being identified as duplicates in TADA_FindPotentialDuplicatesMultipleOrgs. And is a result of updates made to TADA_FindNearbySites.

wokenny13 commented 1 month ago

I am also trying to take a look into this.

I ran TADA_FindPotentialDuplicatesMultipleOrgs and TADA_FindPotentialDuplicatesSingleOrg with the 1st df example.

The number of rows increased only for TADA_FindPotentialDuplicatesMultipleOrgs in which 25 were potentially identify which coincides with the number of rows that were increased.

TADA_FindPotentialDuplicatesSingleOrg identfiies potential duplicates of 44 results, but did not add additional rows in the 1st df example

hillarymarler commented 1 month ago

I think the issue may be here:

# get rid of results with no site group added - not duplicated spatially
  dupsites <- subset(dupsites, !dupsites$TADA.MonitoringLocationIdentifier %in% c("No nearby sites")) %>%
    tidyr::separate_rows(TADA.MonitoringLocationIdentifier, sep = ",")

As a result of changes to TADA_FindNearbySites

wokenny13 commented 1 month ago

Values of logical values of NA were found in .data for TADA.MonitoringLocationIdentifier whereas values in dupsdat for TADA.MonitoringLocationIdentifier were character "NA".

typeof(dupsdat$TADA.MonitoringLocationIdentifier) [1] "character" df_nearby_sites_test <- TADA_FindNearbySites(df_ex) [1] "No nearby sites detected using input buffer distance." typeof(df_nearby_sites_test$TADA.MonitoringLocationIdentifier) [1] "logical"

Inserting this in line 1278 under # connect back to original dataset may be a solution

dplyr::mutate( TADA.MonitoringLocationIdentifier = ifelse(TADA.MonitoringLocationIdentifier %in% NA, "NA", TADA.MonitoringLocationIdentifier)) %>%

Unless there is a preferred variable type that would like to be converted to within the TADA_FindNearbySites() function.