edwindj / daff

Diff, patch and merge for data.frames, see http://paulfitz.github.io/daff/
https://edwindj.github.io/daff/
Other
152 stars 18 forks source link

feature request: summary #12

Closed jsta closed 7 years ago

jsta commented 7 years ago

It would be amazing if there was summary method for diff_data objects providing a tally of differences in terms of cells, columns, and rows (possibly % difference) etc.

# from README
library(daff)
y <- iris[1:3,]
x <- y

x <- head(x,2) # remove a row
x[1,1] <- 10 # change a value
x$hello <- "world"  # add a column
x$Species <- NULL # remove a column

# modified
x <- rbind(x, c(3, 3, 3, 3, "test"))
x <- x[-2,]

patch <- diff_data(y, x)

changes <- length(grep("->", unlist(patch$get_data())))
col_added <- length(which(names(patch$get_data()) == "+++"))
col_rmd <- length(which(names(patch$get_data()) == "---"))
edwindj commented 7 years ago

Thanks for your feature request (nice one!). I'm a bit tied up this week and will look into it next week.

edwindj commented 7 years ago

I did a first implementation of the summary

library(daff)
x <- iris
x[1,1] <- 10
dd <- diff_data(x, iris)
summary(dd)

dd_sum <- summary(dd)
unclass(dd_sum)

Any further suggestions?

jsta commented 7 years ago

Very nice! Thats close to what I came up with using compareDF:

diff_csv <- function(original_csv, hand_edit_csv){
  orig_csv  <- read.csv(original_csv, stringsAsFactors = FALSE)
  hedit_csv <- read.csv(hand_edit_csv, stringsAsFactors = FALSE)

  res <- compareDF::compare_df(hedit_csv, orig_csv, c("pagenum"))

  rows_changed <- res$change_summary[3]
  cell_changes <- length(grep("\\+",
                   unlist(res$comparison_table_diff[,3:ncol(res$comparison_table_diff)])))
  percent_diff <- round(cell_changes / 
                   length(unlist(res$comparison_table_diff[,3:ncol(res$comparison_table_diff)])) * 100, 2)

  paste0(cell_changes, " cells changed; ",
         rows_changed, " rows changed; ",
         percent_diff, "% percent difference")
}

[1] "789 cells changed; 159 rows changed; 7.45% percent difference"

gwarnes-mdsol commented 7 years ago

Pull #13 (accepted) modifies 'summary.data_diff' to calculate the number of changed/added/removed rows and columns:

> library(daff)
> iris2 <- cbind(iris, sl.sq=iris$Sepal.Length ^2 , prod.sl.sw=iris$Sepal.Length * iris$Sepal.Width)
> iris2$Petal.Length[14] = 10
> iris2$Petal.Width[22] = 25
> iris2 <- iris2[-10,]
> iris2 <- rbind(iris2, iris2[3:7,])
> dd <- diff_data(iris, iris2)
> summary(dd)

Data diff:
 Comparison: ‘iris’ vs. ‘iris2’ 
        #           Changed Removed Added
Rows    150 --> 154 2       1       5    
Columns 5 --> 7     0       0       1