Suggestion for a new "sparse" output format

bandrewfox commented 12 months ago

Is your feature request related to a problem? Please describe. I was using this nice package here to compare to data frames, but my DFs were about 60 columns wide and 250 rows. I wanted to have a easy way to see all the changes across the data frames without trying to carefully look for different colored text in the very wide html output.

Describe the solution you'd like The desired output format would be "sparse" -- which means that each change of a value would be a row in the output table. So then I could see all the changes. The column names of the output table could be: id/group, change_type, column_with_change, old_value, new_value.

Describe alternatives you've considered I did not look for other R packages after finding this one.

Additional context I have written the code and put it into a branch of a forked repository. Please review and provide feedback if you'd like for me to initiate a pull request: https://github.com/bandrewfox/compareDF/blob/create_sparse_output/R/fnsOutputs.R#L160

Here is example output using the example objects in this package:

> ctable_student = compare_df(results_2011, results_2010, c("Student"))
> create_sparse_output(ctable_student)

   Student  change_type     column old_value new_value
1   Akshay value_change Discipline         B         A
2   Ananth value_change      Maths        99        78
3    Bulla value_change      Maths        84        97
4    Isaac value_change Discipline         B         A
5     Jojy value_change      Maths        67        99
6    Katti value_change      Maths        90        78
7    Rohit value_change   Division         A       A,B
8    Rohit value_change      Maths        95     94,79
9    Rohit value_change    Physics        92     92,96
10   Rohit value_change Discipline         C       D,C
11   Rohit value_change         PE         B       B,C
12   Rohit value_change        Art        37     37,39
13    Venu value_change      Maths        99       100
14 Vishwas value_change      Maths        93        82
15 Vishwas value_change Discipline         A         B
16 DIkChik    row_added       <NA>      <NA>      <NA>
17   Rohit    row_added       <NA>      <NA>      <NA>
18  Vikram    row_added       <NA>      <NA>      <NA>
19 Dhakkan  row_deleted       <NA>      <NA>      <NA>
20  Mugger  row_deleted       <NA>      <NA>      <NA>

(The comma separated values in new_value are because Rohit from Division B got a new row, but since the keys were only "Student", then the new values need to be combined)

And if you need two columns to uniquely identify a student, then my code also supports that as follows:

> ctable_student_div = compare_df(results_2011, results_2010, c("Student", "Division"))
> create_sparse_output(ctable_student_div, orig_group_cols = c("Student", "Division"))
   grp Student:Division  change_type     column old_value new_value
1    1         Akshay,A value_change Discipline         B         A
2    2         Ananth,A value_change      Maths        99        78
3    3          Bulla,B value_change      Maths        84        97
4    4        DIkChik,B    row_added       <NA>      <NA>      <NA>
5    5        Dhakkan,B  row_deleted       <NA>      <NA>      <NA>
6    6          Isaac,A value_change Discipline         B         A
7    7           Jojy,B value_change      Maths        67        99
8    8          Katti,B value_change      Maths        90        78
9   10         Mugger,B  row_deleted       <NA>      <NA>      <NA>
10  11          Rohit,A value_change      Maths        95        94
11  11          Rohit,A value_change Discipline         C         D
12  12          Rohit,B    row_added       <NA>      <NA>      <NA>
13  13           Venu,A value_change      Maths        99       100
14  14         Vikram,B    row_added       <NA>      <NA>      <NA>
15  15        Vishwas,A value_change      Maths        93        82
16  15        Vishwas,A value_change Discipline         A         B

This case works better since then it is clear that Rohit,B was the new row and then Rohit,A had some specific changes.

(I had thought also to separate those values with a colon instead of a comma. I could add a "sep" parameter).

alexsanjoseph commented 12 months ago

Hi Brian, thanks for the issue. I can see that you are moving away from the "tidy" format into a "wide" format. Have you tried

ctable = compare_df(new_df, old_df, c("var1"))

wide_output = create_wide_output(ctable)
print(wide_output)

You get

grp var2_old var2_new var1_old var1_new val3_old val3_new val2_old val2_new val1_old val1_new

1   2        Y        Y        B        B        2      2.1       B1       B1        2        2
2   3        X        X        C        C        3      4.0       C1       C2        3        3

From this you can use select to choose the columns you are interested

bandrewfox commented 12 months ago

Hi Alex, thanks for the quick reply and suggestion.

I tried the wide format, but it didn't report the changes as I wanted. The problem is that if I want only the changed values and not the equal values, then each row would have a different set of columns to show (and potentially multiple columns). I was aiming to have a "change table" where each row refers to a single element in the data frame which was changed from the old to new version. That is why I am calling it "sparse" -- since it is just picking out various elements of interest from the full data frame. I can imagine a large table with a couple changes per row, and this sparse format would be much more concise.

For context to my process, the next step in my data reconciliation process is to save this "change table" to Excel and then add a new column called "comment" and then the user who is responsible for the data can document why they made each specific change from the prior version so that we can track all changes and the reasons. Using a wide format, there would be very many unchanged values and I wouldn't be able to add comments to each change unless I added a "comment" column for each of the original columns (e.g. var2_comment, var1_comment, val3_comment, etc). In a wide format, each row could have many changes (or zero changes) across many columns and the next row might have changes in different columns. Additionally, in a wide format, I would still have to manually examine a large table to compare the "old" and "new" in order to locate the specific values across the whole data frame which were changed.

I hope this is making some sense. My new function is certainly useful for my use case and I appreciate this package since it quickly finds all the changes. But I'm not sure how many people would want this type of view and if it is worth adding to your code.

alexsanjoseph commented 12 months ago

I think I understand your use case. I feel like this can be a simple data transformation function rather than an API capability.

data("results_2010", "results_2011")
results_2010_long = results_2010 %>% 
  mutate_if(is.integer, as.character) %>% 
  tidyr::pivot_longer(c(-Division, -Student), names_to = "Subject")
results_2011_long = results_2011 %>% 
  mutate_if(is.integer, as.character) %>% 
  tidyr::pivot_longer(c(-Division, -Student), names_to = "Subject")
ctable = compare_df(results_2010_long, results_2011_long, c("Division", "Student"))
ctable$comparison_df %>% tidyr::pivot_wider(id_cols=c(grp, Student, Subject), names_from=chng_type)

Result:

     grp Student Subject    `+`   `-`  
   <int> <chr>   <chr>      <chr> <chr>
 1     1 Akshay  Discipline B     A    
 2     2 Ananth  Maths      99    78   
 3     3 Isaac   Discipline B     A    
 4     4 Rohit   Maths      95    94   
 5     4 Rohit   Discipline C     D    
 6     5 Venu    Maths      99    100  
 7     6 Vishwas Maths      93    82   
 8     6 Vishwas Discipline A     B    
 9     7 Bulla   Maths      84    97   
10     8 DIkChik Maths      NA    91

bandrewfox commented 12 months ago

Yes! That would work well - that's clever.

The only problem I see with that is that none of your nice tolerance options for ignoring numeric values near each other would work since all the data needs to be forced to character in the long format if any of the columns are character. I suppose you could then try to run compare_df twice, once for numeric columns and once for character columns.... but then we're starting to lose the convenience value of this package.

It seems fair to close this issue if you'd like. If that's your decision, then I might go ahead and put my function in a separate plain repo my github in case people want to use it for convenience.

alexsanjoseph commented 12 months ago

I suppose you could then try to run compare_df twice, once for numeric columns and once for character columns

Yes that's what I would do, since I think the case that you have, while very relevant is really a specific case for your use case. I will keep this in mind and bring it into the API if I get a few more requests like this.

It seems fair to close this issue if you'd like. If that's your decision, then I might go ahead and put my function in a separate plain repo my github in case people want to use it for convenience.

Absolutely! That's the spirit of FOSS. I am sure other people might be able to use this as a convenience function

bandrewfox commented 12 months ago

Sounds good! Here's my function in case anyone wants to use it:

https://github.com/bandrewfox/compareDF_sparse_output

alexsanjoseph / compareDF

Suggestion for a new "sparse" output format #53