D3SL commented 4 years ago

It occurred to me while I was comparing a number of very large files that row based output is really only good for eyeballing small amounts of simple data. If the current outputs were offered in column based format a whole lot more could be done with it.

Everything in R is column oriented and when the differences are listed as key old.x new.x old.y new.y that information is much more programmatically accessible. Calculations based on differences, much more flexible output formatting, use outside of R in things like pivot tables, more than 2 items could be compared purr::reduce() style. Modifications would be easier to spot as well in any comparison with a key column. If the key exists in both objects and other columns' values are not identical it's a modification rather than full addition/subtraction.

alexsanjoseph commented 4 years ago

Hi @D3SL -

The decision to give a wide output is intentional, as in makes visual comparison easy first up. That said, the output is also provided in comparison_df object which can be further manipulated in any of the ways that you mentioned for specific problems one may have. Have you tried this approach?

Maybe I'm not following you completely, so can you help me understand this problem with some reprex?

D3SL commented 4 years ago

Sure, I actually did find a package that gets close to this but is heavily limited in other ways. It only shows rows that exist in both objects, and rather than showing each row in its full original form they dedicate one row per variable that's changed, which is imho not as human friendly. Between the two of you there's the perfect data-content comparison package.

alexsanjoseph commented 4 years ago

Thanks for the example, tt's an interesting use case. I have a couple of questions, though

1) Why is this something that's not achieved by a pivot_wider from the compare_df output, depending on the specifics of how the user? 2) How do you deal with cases when one row/group has been replaced by more than one observation?

D3SL commented 4 years ago

Pivoting sort of works. You need to have a single key column to start with, add a marker column to track whether each object is the new or old data, and the tradeoff you'll only have one copy of your key column with both dataframes' worth of rows in it as opposed to a fully side-by-side comparison. I think it's possible to get around that by creating duplicates of the key column in each dataframe but that's an ugly hack. You also lose the chng_type column but that can be programmatically recreated.
That would actually be even more clear in a side-by-side format. The old data only has the original row, the new data has multiple rows. NAs are inserted where needed.

alexsanjoseph commented 4 years ago

What if you do a merge(join) instead. The group column will be the key, and you get all the rows from both sides? Is that the kind of output you're looking at?

D3SL commented 4 years ago

A self full join in the pattern of full_join(comparison_df[comparison_df$key == 'new'], comparison_df[comparison_df$key == 'old'], by='groupcol') is just another way of doing the same thing as pivoting to long then back to wide, it just manages to keep the chng_type column where pivoting loses it. In both cases the user needs to add a redundant column to each dataframe whose only purpose is to track whether a row is from the old or new dataframe, and in both cases the comparison's group column comes out with all rows from both objects, only the remaining columns are in the old.x, new.x format.

In neither case do you get a true side by side comparison with all columns separated by old object and new object like so:

chng_type    x.old     y.old     z.old     x.new     y.new     z.new
    +         NA        NA        NA       1           1         1
    -         1        1        1          NA          NA        NA
    ~         1        1        1          2           2        2

The closest that you can get is this, which still doesn't allow for full programmatic use of the output since you don't have full information. There's only one monolithic x column which contains all rows of x from both df_old and df_new. Again the only workaround I can think of is making an unnecessary clone of the x column in df_old and df_new that isn't used as a join key or group column.

chng_type.old    x     y.old     z.old   chng_type.new   y.new     z.new
    -            1        NA       NA        +            1         1
    +            1        1        1         -            NA        NA
    -            1        1        1         NA           NA        NA
    NA           2        NA       NA        +            2         2

D3SL commented 4 years ago

I tested that workaround idea and found it easiest to do with data.table, mutate with two new columns each containing if_else should also work. I think this method even handles cases where one observation becomes several gracefully since the .old columns will simply have extra rows of NA matching the .new columns.

I'm still uncomfortable with the potential inefficiency of creating two extra copies of the group column to make this work but if someone's riding so close to the edge of their memory limits that this is what pushes them over they should probably be using one of the work-off-disk packages already anyway.

Edit: I never actually looked at the source code, I didn't realise you'd written this in R and not through something like RCPP. I'll see if I can add this myself now that I know everything is written in R.

alexsanjoseph commented 4 years ago

@D3SL - Glad you found a workaround.

Yeah currently the code is not built for performance, I'm not sure if I'll come around to rewriting this in RCpp which would be vastly better. The tests are pretty comprehensive, so if you have expertise in this, it would be great.

Is there anything further or shall I close this?

D3SL commented 4 years ago

I don't know enough C to do something this complicated unfortunately. I do know my way around data.table though, which could still be a significant performance improvement over dplyr as the core. I'd need to figure out what all is actually going on under the hood to refactor like that though.

alexsanjoseph commented 4 years ago

Yeah data.table might be a good compromise. I'll add an enhancement to see if I can do it in data.table. Thanks for the inputs!

alexsanjoseph commented 4 years ago

alexsanjoseph / compareDF

More flexible output, side-by-side as an option #34

35