alexsanjoseph / compareDF

R Tool to compare two data.frames
Other
93 stars 17 forks source link

Bug in compare_df function #49

Closed jenko1979 closed 1 year ago

jenko1979 commented 1 year ago

Describe the bug The bug produces extra grouping rows when comparing datasets, so you end up wiht a '+' and and '=' in the same grouping which is incorrect. The bug is not seen if numeric variables are converted to character first, but then this means tolerance options in the function become redundant

To Reproduce RDS dataset provided in email advs and qcadvs

The example below is actually a completely new row (as I had deleted this row form the qc version), so again, it should not ever have a = row here for this record.

The below shows the group that is causing the issue here

image

you can see the issues by running the following code:

paul1 <- advs %>% mutate(ADT = as.character(ADT)) %>% filter(USUBJID == "PHUSE10001") %>% select(USUBJID, PARAMCD, ADT, AVAL, CHG, BASE)

paul2 <- qcadvs %>% mutate(ADT = as.character(ADT)) %>% filter(USUBJID == "PHUSE10001") %>% select(USUBJID, PARAMCD, ADT, AVAL, CHG, BASE)

library(compareDF) compare_df(paul1, paul2, group_col = c("USUBJID","PARAMCD", "ADT"), keep_unchanged_rows = TRUE, tolerance = 0)

x <- compare_df(paul1, paul2, group_col = c("USUBJID","PARAMCD", "ADT"), keep_unchanged_rows = TRUE, tolerance = 0)

y <- x$comparison_df View(y)

We have found a way around this is to convert all numeric variables to character and the compare_df function then works as expected in this case giving the correct new row ‘+’ only for this group. However, this approach then makes the tolerance options all redundant as you cannot use this on character variables.

Here is the code where I just convert the 3 numeric variables to character and run the exact same comparison:

paul1c <- paul1 %>% mutate(AVAL = as.character(AVAL), CHG = as.character(CHG), BASE = as.character(BASE), )

paul2c <- paul2 %>% mutate(AVAL = as.character(AVAL), CHG = as.character(CHG), BASE = as.character(BASE), )

compare_df(paul1c, paul2c, group_col = c("USUBJID","PARAMCD", "ADT"), keep_unchanged_rows = TRUE, tolerance = 0)

xc <- compare_df(paul1c, paul2c, group_col = c("USUBJID","PARAMCD", "ADT"), keep_unchanged_rows = TRUE, tolerance = 0)

yc <- xc$comparison_df View(yc)

This gives the correct result: image

alexsanjoseph commented 1 year ago

@jenko1979 - I've pushed a latest version on the master branch. You can install this by

devtools::install_github('alexsanjoseph/compareDF')

Can you try and see if this fixes your problem?

jenko1979 commented 1 year ago

@alexsanjoseph - I tested on this test data and now the results are as expected. I guess maybe further testing on much larger datasets may be worthwhile to see if there are any differences seen when numeric variables are all converted to character (which seems to have been working very well from extensive use with this method) versus when numeric variables are left as numeric - but initial signs are good that this has been resolved. Thanks for the quick response!