Triamus / play

play repo for experiments (mainly with git)
1 stars 0 forks source link

data quality readme #19

Open Triamus opened 6 years ago

Triamus commented 6 years ago

title: "R Notebook" output: html_notebook editor_options: chunk_output_type: console

Description

checkthat - A data quality framework in R


Content


Overview

The overall idea of this package is to introduce a data quality framework which enables users to define a set of rules that any data input should adhere to. We will focus on tabular data within data frames to start with. The final package will probably be a set of functions that allow the user to detemine the number and potentially the materiality of any rule violation. The exact output variations need to be clarified.


Framework design

Terminology

Data

When we refer to data, we usually mean a tabular (i.e. data frame) or columnar (i.e. vector) construct.

Rule

When we refer to rule or ruleset, we usually mean a set of expressions to be evaluated against some data. We may use rule, expression or condition interchangeably but stick to the term expression when it concerns the programming paradigm.

Measure

When we refer to measure, we usually mean a variable to quantify the extent of rule violations. This will most often be a simple count or a defined materiality measure that resides in the measured data.

Measurement item

When we refer to measurement item, we usually mean the abstract notion of a rule in combination with a key/id independent of the data it may be applied to.

Measurement

When we refer to measurement, we usually mean the concrete quantification/materialization of a measurement item with respect to the rule it is based on, the key/id it is mapped to and the data it is evaluated against. In addition a measurement may be connected to a timestamp to differentiate against the same measurement performed in a different point in time.

User

When we refer to user or you, we mean a person or group of persons of any gender.

Design ideas

Overall concept

A user provides a set of rules in a defined rule language that can be parsed by functions to evaluate against a given dataset. The user receives an output that indicates the data (variable) quality in terms of some meaningful measure (e.g. count). It is to be decided whether the output is only returning an aggregated result, i.e. an aggregated measure per measured variable, or also the full evaluation result on row-basis.

Data

Theoretically, the data could be of any shape and any type but for all practical purposes we focus on tabular data that resides in a data frame.

Rules

We can think of rules from two perspectives, first the type of rule that is applied and second the data dimension it is applied to.

Extensions

One can think of a few extensions to the basic design, in particular


Some initial implementation ideas

Working with tidyeval and expressions in the tidyverse

To parse expressions programmatically we could make use of the tidy evaluation approach within the tidyverse and in particular functions in dplyr, purrr, and rlang. If you are familiar with the SAS Macro Language and in particular its quoting rules, you may be reminded of it in some instances although rlang is ultimately quite different. For further resources around tidy evaluation and rlang, see

We make up some minimal examples to showcase the overarching ideas.

library(tidyverse)
library(rlang)

packageVersion("dplyr")
packageVersion("rlang")

data(mtcars)

In base R we may evaluatie some boolean expression as so.

# base R plain boolean
mtcars[,"cyl"] > 4
mtcars$cyl > 4

In rlang we can work with quosures to quote our expression and leave it unevaluated until explicitly called. Note that quosures are environment-aware. We can evaluate via rlang::eval_tidy().

# rlang with tidy evaluation
cyl_larger_4 <- quo(mtcars$cyl > 4)
cyl_larger_4
class(cyl_larger_4)
typeof(cyl_larger_4)
rlang::eval_tidy(cyl_larger_4)

# without providing data in condition
cyl_larger_4_b <- rlang::quo("cyl" > 4)
cyl_larger_4_b
# this selects all filtered data
mtcars[rlang::eval_tidy(cyl_larger_4_b),]

Ultimately, we like to build functions around our calls so let's work our way towards a first working example.

# select one variable
temp_fn <- function(x, var){
  print(rlang::enquo(var))
  var_new <- rlang::enquo(var) 
  x %>% 
    select(., rlang::UQ(var_new))
  }
temp_fn(mtcars, cyl)

# filter one variable
temp_fn <- function(x, condition){
  condition_quoted <- rlang::enquo(condition)
  x %>% 
    dplyr::filter(., rlang::eval_tidy(rlang::UQ(condition_quoted)))
  }
temp_fn(mtcars, (cyl > 4))

# transmutate one variable
temp_fn <- function(x, condition){
  condition_quoted <- rlang::enquo(condition)
  x %>% 
    dplyr::transmute(., test = rlang::eval_tidy(rlang::UQ(condition_quoted)))
  }
temp_fn(mtcars, (cyl > 4))

# transmute one variable and change output name
temp_fn <- function(x, condition, condition_name){
  condition_quoted <- rlang::enquo(condition)
  x %>% 
    dplyr::transmute(., rlang::UQ(condition_name) :=
                       rlang::eval_tidy(rlang::UQ(condition_quoted)))
  }
temp_fn(mtcars, (cyl > 4), "cyl_larger_four")

In the end we ideally can provide key-value pairs of named (keyed) expressions. The following design has been adapted from a discussion around Passing named list to mutate and probably other dplyr verbs.

# multiple arguments with names
temp_fn <- function(x, args) {
  mutate(x, rlang::UQS(args))
  }

temp_fn(
  mtcars, 
  args = rlang::quos(cyl_larger_4 = cyl > 4, 
                     mpg_smaller_mean_mpg = mpg < mean(mpg)))

# also works in a dplyr chain
mtcars %>%
  temp_fn(rlang::quos(
    cyl_larger_4 = cyl > 4, 
    mpg_smaller_mean_mpg = mpg < mean(mpg)))

# we can also only keep the outcome
temp_fn <- function(x, args) {
  transmute(x, rlang::UQS(args))
  }

mtcars %>%
  temp_fn(rlang::quos(
    cyl_larger_4 = cyl > 4, 
    mpg_smaller_mean_mpg = mpg < mean(mpg)))

We could provide a materiality measure by introducing it as another expression via multiplying the boolean result with respective measure.

mtcars %>%
  temp_fn(rlang::quos(
    cyl_larger_4 = cyl > 4,
    cyl_larger_4_mat = (cyl > 4) * mpg,
    mpg_smaller_mean_mpg = mpg < mean(mpg)))

Finally, we may only be interested in the overall result rather than row-wise outcomes.

mtcars %>%
  temp_fn(rlang::quos(
    cyl_larger_4 = cyl > 4,
    cyl_larger_4_mat = (cyl > 4) * mpg,
    mpg_smaller_mean_mpg = mpg < mean(mpg))) %>%
  summarise(cyl_larger_4 = sum(cyl_larger_4, na.rm = TRUE),
            cyl_larger_4_mat = sum(cyl_larger_4_mat, na.rm = TRUE),
            mpg_smaller_mean_mpg = sum(mpg_smaller_mean_mpg, na.rm = TRUE))
# randomly creating missing values for different columns with pre-defined missing
# proportion; function taken from:
# https://stackoverflow.com/questions/39513837/add-exact-proportion-of-random-missing-values-to-data-frame
create_na <- function (x, pctNA = 0.1) {
  n <- nrow(x)
  p <- ncol(x)
  NAloc <- rep(FALSE, n * p)
  NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
  x[matrix(NAloc, nrow = n, ncol = p)] <- NA
  return(x)
  }

mtcars_adj <-
  mtcars %>%
  create_na(., pctNA = 0.1)

library(wakefield)
# https://github.com/trinker/wakefield

# age   dice hair   military sex_inclusive animal   dna height  month   smokes
# answer    dob income  name    speed area  dummy   internet_browser normal state
# car   education   iq  political   string children employment language race    upper 
# coin  eye level   religion    valid color grade   likert  sat year date_stamp 
# grade_level   lorem_ipsum sentence    zip_code death  group   marital sex

set.seed(5000)

temp_data <-
  wakefield::r_data_frame(
    n = 500,
    id,
    race,
    age,
    sex,
    date_stamp,
    iq,
    height,
    died,
    string,
    grade, grade, grade
    ) %>%
  # randomly creating missing values for different columns with pre-defined missing
  # proportion
  wakefield::r_na(prob = 0.1) %>% 
  rename_all(tolower)

format(utils::object.size(temp_data), units = "Mb")

# we could even plot variable types and missing values via plot() method
temp_data %>% plot()

rules <-
  rlang::quos(
    iq_larger_100 = (iq > 100),
    iq_larger_100_mat = (iq > 100) * height,
    age_missing = (is.na(age)),
    grade_1_smaller_mean_grade_1 = (grade_1 < mean(grade_1, na.rm = TRUE)),
    grade_1_larger_grade_2 = (grade_1 > grade_2),
    race_value_range = !(race %in% c("White", "Hispanic", "Native", NA, "Black", 
                                    "Asian", "Bi-Racial", "Other")),
    race_length = (stringr::str_length(race) > 8),
    string_contains_numeric = (stringr::str_detect(string, "[[:digit:]]"))
    )

results <-
  temp_data %>%
  temp_fn(rules)
results