alan-turing-institute / datadiff

Datadiff is diff for data
MIT License
26 stars 2 forks source link

Overview

Tabular data sets are common, and many data processing tasks must be repeated on multiple similar data samples. In practice, however, there may be unexpected changes in structure across different batches of data, which are likely to break the analytical pipeline.

Datadiff identifies structural differences between pairs of (related) tabular data sets and returns an executable summary (or "patch") which is both a description of the differences and a corrective transformation.

In making comparisons, datadiff considers the following (composable) patch types:

Installation

Datadiff is implemented in R and can be built from source or installed using the devtools package as follows.

# Install the most recent release from GitHub:
# install.packages("devtools")
devtools::install_github("alan-turing-institute/datadiff")

Usage

Diff two data frames with ddiff(df1, df2).

For more information and examples, see the package vignette:

# Build the vignette on package installation:
devtools::install_github("alan-turing-institute/datadiff", build_vignettes = TRUE)
vignette("datadiff")