Closed pbkeating closed 5 years ago
@pbkeating - can you try a group by by chng_type
on the $comparison_df
object? Will that solve the problem?
Hi,
Thanks for that. Tried it, but as chng_type
is binary and applies to additions, removals and modifications, it doesn't allow for easy separation. That would be a great way of doing it though if there were different chng_type
for additions, subtractions and modifications.
Thanks, Patrick
Can you try installing the package from the add_grouping_type branch? In that add a new argument, add_grouping_type=TRUE
to the compare_df
call. Does this help you?
@pbkeating
Apologies for delay in getting back. Got involved in other activities.
I tried to install the install_github("alexsanjoseph/compareDF", ref = "add_grouping_type")
package but when I tried to run it, I got the following error
Error in compare_df(x, x_ref, c("id"), limit_html = 700, add_grouping_type = TRUE) : unused argument (add_grouping_type = TRUE)
I examined the code and couldn't find the argument add_grouping_type. See txt file attached. comparedf_function.txt
Are you sure the installation went fine? (Try restarting R, maybe?) The function(and the new argument) is here jn this branch - https://github.com/alexsanjoseph/compareDF/blob/add_grouping_type/R/fnsComparison.R
Thanks. I restarted R and could run the function. Thanks for the speedy reactions on this!
I ran it and see the output below. Unfortunately, the numbers of additions and removals identified by the change_summary
variable are different from the add_grouping_type.
Worth adding in a removed
grouping?
Just realised that removed is there. Thanks!
I added the group calculation using a heuristic that I thought would work. Can you add a small reproducible example which shows the mismatch between the change summary and the grouping?
I tried out a simple example and it seems to have worked as expected...not sure why it was not the case with the dataset that I have used
a <- data.frame(id = c("t1","t2","t3","t4","t5"), gender = c("m","f","m","f", "f"), age = c(10, 13, 14, 90, 20))
b <- data.frame(id = c("t1","t4","t5", "t6", "t7", "t8", "t9"), gender = c("f","f","m","f", "f", "m", "m"), age = c(10, 93, 20, 15, 20, 35, 56))
test <- compareDF::compare_df(b, a, c("id"), add_grouping_type = TRUE)
test$comparison_df
I selected 12 variables when comparing... could that affect it?
I have no idea - my internal test case also seem to have worked, hence I did the commit. This should be some sort of edge case. Hopefully you can find out what's going on with a smaller dataset, don't know what to do otherwise
Hi again. I tested with just 4 variables and it got closer to the real situation, but not completely there. It sometimes misclassifies newly added rows with modified rows and some removed rows as modified rows.
When I add further variables for comparison, the differences become more evident. Below with 5 variables added
Hmmm there is definitely a bug in there somewhere. If you are able to find me a few cases that will be amazing. In the meantime I'll try if I can reproduce it myself in a bit. Keeping the ticket open till then.
Wondering if has anything to do with inclusion of dates as variables. I was using three date variables and wondered if that could have made an impact.
Thanks again!
@pbkeating - Can you share a snippet of the data that you're working with (maybe anonymized?)
Hi,
I'll do my best to get you a similar dataset over the coming days. Thank you for the follow-up on this!
@pbkeating - were you able to get a reprex for this? If not I'll have to close this :/
Hi,
No, I didn't manage it. Thanks for all your help and responsiveness.
Best, Patrick
On Wed 5 Jun 2019, 05:42 Alex Joseph notifications@github.com wrote:
@pbkeating https://github.com/pbkeating - were you able to get a reprex for this? If not I'll have to close this :/
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alexsanjoseph/compareDF/issues/20?email_source=notifications&email_token=AE2NEQZ2RTJBUPJCH3K4K3TPY47Z3A5CNFSM4HJ33MQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW6SS4Y#issuecomment-498936179, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2NEQYMYWSPYQYFGU4RL2DPY47Z3ANCNFSM4HJ33MQQ .
Thanks @pbkeating, feel free to create a new issue if this is a problem. I'm not merging the branch for now, since it seems to have issues
Hello Alex,
I've been regularly using and finding the compareDF package to be very helpful since I found it last November.
The html output is really helpful, but wondered if it would be possible to add functionality to produce separate outputs for additions to databases and for modifications? When many additions and modifications have been made between two versions of a database and then both of these types of outputs are mixed, it makes it more difficult to use the output. I'm currently adapting the comparison_df output, but great if it was a built in functionality.
Thanks again for this package! Patrick