Closed bandrewfox closed 12 months ago
Hi Brian, thanks for the issue. I can see that you are moving away from the "tidy" format into a "wide" format. Have you tried
ctable = compare_df(new_df, old_df, c("var1"))
wide_output = create_wide_output(ctable)
print(wide_output)
You get
grp var2_old var2_new var1_old var1_new val3_old val3_new val2_old val2_new val1_old val1_new
1 2 Y Y B B 2 2.1 B1 B1 2 2
2 3 X X C C 3 4.0 C1 C2 3 3
From this you can use select
to choose the columns you are interested
Hi Alex, thanks for the quick reply and suggestion.
I tried the wide format, but it didn't report the changes as I wanted. The problem is that if I want only the changed values and not the equal values, then each row would have a different set of columns to show (and potentially multiple columns). I was aiming to have a "change table" where each row refers to a single element in the data frame which was changed from the old to new version. That is why I am calling it "sparse" -- since it is just picking out various elements of interest from the full data frame. I can imagine a large table with a couple changes per row, and this sparse format would be much more concise.
For context to my process, the next step in my data reconciliation process is to save this "change table" to Excel and then add a new column called "comment" and then the user who is responsible for the data can document why they made each specific change from the prior version so that we can track all changes and the reasons. Using a wide format, there would be very many unchanged values and I wouldn't be able to add comments to each change unless I added a "comment" column for each of the original columns (e.g. var2_comment, var1_comment, val3_comment, etc). In a wide format, each row could have many changes (or zero changes) across many columns and the next row might have changes in different columns. Additionally, in a wide format, I would still have to manually examine a large table to compare the "old" and "new" in order to locate the specific values across the whole data frame which were changed.
I hope this is making some sense. My new function is certainly useful for my use case and I appreciate this package since it quickly finds all the changes. But I'm not sure how many people would want this type of view and if it is worth adding to your code.
I think I understand your use case. I feel like this can be a simple data transformation function rather than an API capability.
data("results_2010", "results_2011")
results_2010_long = results_2010 %>%
mutate_if(is.integer, as.character) %>%
tidyr::pivot_longer(c(-Division, -Student), names_to = "Subject")
results_2011_long = results_2011 %>%
mutate_if(is.integer, as.character) %>%
tidyr::pivot_longer(c(-Division, -Student), names_to = "Subject")
ctable = compare_df(results_2010_long, results_2011_long, c("Division", "Student"))
ctable$comparison_df %>% tidyr::pivot_wider(id_cols=c(grp, Student, Subject), names_from=chng_type)
Result:
grp Student Subject `+` `-`
<int> <chr> <chr> <chr> <chr>
1 1 Akshay Discipline B A
2 2 Ananth Maths 99 78
3 3 Isaac Discipline B A
4 4 Rohit Maths 95 94
5 4 Rohit Discipline C D
6 5 Venu Maths 99 100
7 6 Vishwas Maths 93 82
8 6 Vishwas Discipline A B
9 7 Bulla Maths 84 97
10 8 DIkChik Maths NA 91
Yes! That would work well - that's clever.
The only problem I see with that is that none of your nice tolerance options for ignoring numeric values near each other would work since all the data needs to be forced to character in the long format if any of the columns are character. I suppose you could then try to run compare_df twice, once for numeric columns and once for character columns.... but then we're starting to lose the convenience value of this package.
It seems fair to close this issue if you'd like. If that's your decision, then I might go ahead and put my function in a separate plain repo my github in case people want to use it for convenience.
I suppose you could then try to run compare_df twice, once for numeric columns and once for character columns
Yes that's what I would do, since I think the case that you have, while very relevant is really a specific case for your use case. I will keep this in mind and bring it into the API if I get a few more requests like this.
It seems fair to close this issue if you'd like. If that's your decision, then I might go ahead and put my function in a separate plain repo my github in case people want to use it for convenience.
Absolutely! That's the spirit of FOSS. I am sure other people might be able to use this as a convenience function
Sounds good! Here's my function in case anyone wants to use it:
Is your feature request related to a problem? Please describe. I was using this nice package here to compare to data frames, but my DFs were about 60 columns wide and 250 rows. I wanted to have a easy way to see all the changes across the data frames without trying to carefully look for different colored text in the very wide html output.
Describe the solution you'd like The desired output format would be "sparse" -- which means that each change of a value would be a row in the output table. So then I could see all the changes. The column names of the output table could be: id/group, change_type, column_with_change, old_value, new_value.
Describe alternatives you've considered I did not look for other R packages after finding this one.
Additional context I have written the code and put it into a branch of a forked repository. Please review and provide feedback if you'd like for me to initiate a pull request: https://github.com/bandrewfox/compareDF/blob/create_sparse_output/R/fnsOutputs.R#L160
Here is example output using the example objects in this package:
(The comma separated values in new_value are because Rohit from Division B got a new row, but since the keys were only "Student", then the new values need to be combined)
And if you need two columns to uniquely identify a student, then my code also supports that as follows:
This case works better since then it is clear that Rohit,B was the new row and then Rohit,A had some specific changes.
(I had thought also to separate those values with a colon instead of a comma. I could add a "sep" parameter).