alexsanjoseph / compareDF

R Tool to compare two data.frames
Other
93 stars 17 forks source link

Facilitate separation of additions to databases and modifications #20

Closed pbkeating closed 5 years ago

pbkeating commented 5 years ago

Hello Alex,

I've been regularly using and finding the compareDF package to be very helpful since I found it last November.

The html output is really helpful, but wondered if it would be possible to add functionality to produce separate outputs for additions to databases and for modifications? When many additions and modifications have been made between two versions of a database and then both of these types of outputs are mixed, it makes it more difficult to use the output. I'm currently adapting the comparison_df output, but great if it was a built in functionality.

Thanks again for this package! Patrick

alexsanjoseph commented 5 years ago

@pbkeating - can you try a group by by chng_type on the $comparison_df object? Will that solve the problem?

pbkeating commented 5 years ago

Hi,

Thanks for that. Tried it, but as chng_type is binary and applies to additions, removals and modifications, it doesn't allow for easy separation. That would be a great way of doing it though if there were different chng_type for additions, subtractions and modifications.

Thanks, Patrick

alexsanjoseph commented 5 years ago

Can you try installing the package from the add_grouping_type branch? In that add a new argument, add_grouping_type=TRUE to the compare_df call. Does this help you?

@pbkeating

pbkeating commented 5 years ago

Apologies for delay in getting back. Got involved in other activities. I tried to install the install_github("alexsanjoseph/compareDF", ref = "add_grouping_type") package but when I tried to run it, I got the following error

Error in compare_df(x, x_ref, c("id"), limit_html = 700, add_grouping_type = TRUE) : unused argument (add_grouping_type = TRUE)

I examined the code and couldn't find the argument add_grouping_type. See txt file attached. comparedf_function.txt

alexsanjoseph commented 5 years ago

Are you sure the installation went fine? (Try restarting R, maybe?) The function(and the new argument) is here jn this branch - https://github.com/alexsanjoseph/compareDF/blob/add_grouping_type/R/fnsComparison.R

pbkeating commented 5 years ago

Thanks. I restarted R and could run the function. Thanks for the speedy reactions on this! I ran it and see the output below. Unfortunately, the numbers of additions and removals identified by the change_summary variable are different from the add_grouping_type.

Worth adding in a removed grouping?

20190502_152533 1

pbkeating commented 5 years ago

Just realised that removed is there. Thanks!

alexsanjoseph commented 5 years ago

I added the group calculation using a heuristic that I thought would work. Can you add a small reproducible example which shows the mismatch between the change summary and the grouping?

pbkeating commented 5 years ago

I tried out a simple example and it seems to have worked as expected...not sure why it was not the case with the dataset that I have used

a <- data.frame(id = c("t1","t2","t3","t4","t5"), gender = c("m","f","m","f", "f"), age = c(10, 13, 14, 90, 20))

b <- data.frame(id = c("t1","t4","t5", "t6", "t7", "t8", "t9"), gender = c("f","f","m","f", "f", "m", "m"), age = c(10, 93, 20, 15, 20, 35, 56))

test <- compareDF::compare_df(b, a, c("id"), add_grouping_type = TRUE)

test$comparison_df

pbkeating commented 5 years ago

I selected 12 variables when comparing... could that affect it?

alexsanjoseph commented 5 years ago

I have no idea - my internal test case also seem to have worked, hence I did the commit. This should be some sort of edge case. Hopefully you can find out what's going on with a smaller dataset, don't know what to do otherwise

pbkeating commented 5 years ago

Hi again. I tested with just 4 variables and it got closer to the real situation, but not completely there. It sometimes misclassifies newly added rows with modified rows and some removed rows as modified rows.

20190502_231309 1

When I add further variables for comparison, the differences become more evident. Below with 5 variables added 20190502_232525 1

alexsanjoseph commented 5 years ago

Hmmm there is definitely a bug in there somewhere. If you are able to find me a few cases that will be amazing. In the meantime I'll try if I can reproduce it myself in a bit. Keeping the ticket open till then.

pbkeating commented 5 years ago

Wondering if has anything to do with inclusion of dates as variables. I was using three date variables and wondered if that could have made an impact.

Thanks again!

alexsanjoseph commented 5 years ago

@pbkeating - Can you share a snippet of the data that you're working with (maybe anonymized?)

pbkeating commented 5 years ago

Hi,

I'll do my best to get you a similar dataset over the coming days. Thank you for the follow-up on this!

alexsanjoseph commented 5 years ago

@pbkeating - were you able to get a reprex for this? If not I'll have to close this :/

pbkeating commented 5 years ago

Hi,

No, I didn't manage it. Thanks for all your help and responsiveness.

Best, Patrick

On Wed 5 Jun 2019, 05:42 Alex Joseph notifications@github.com wrote:

@pbkeating https://github.com/pbkeating - were you able to get a reprex for this? If not I'll have to close this :/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexsanjoseph/compareDF/issues/20?email_source=notifications&email_token=AE2NEQZ2RTJBUPJCH3K4K3TPY47Z3A5CNFSM4HJ33MQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW6SS4Y#issuecomment-498936179, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2NEQYMYWSPYQYFGU4RL2DPY47Z3ANCNFSM4HJ33MQQ .

alexsanjoseph commented 5 years ago

Thanks @pbkeating, feel free to create a new issue if this is a problem. I'm not merging the branch for now, since it seems to have issues