data-cleaning / errorlocate

Find and replace erroneous fields in data using validation rules
http://data-cleaning.github.io/errorlocate/
22 stars 3 forks source link

Allow `log()` in linear statements / expand linear checks #28

Closed edwindj closed 3 years ago

edwindj commented 3 years ago

A useful addition to the allowed syntax would be linear rules including log transforms.

Currently the following checks are ignored by errorlocate, because they are non-linear

# non linear check, so ignored by errorlocate
total_salary >= n_employees * min_salary

A log transform makes this a linear statement

# currently ignored by errorlocate because of `log` inside the linear statement
log(total_salary) >= log(n_employees) + log(min_salary)

This would need some refactoring of the code, because first some extra data columns need to be derived with the log transforms. This is analogous to how lm works (first deriving a data matrix). The variable and its log transform should use the same error indicator: so breaching a rule where either a variable or its log transform is used, results in setting the variable to faulty.

Note this can be expanded to include more monotonic increasing functions...

edwindj commented 3 years ago

This is implement in the current github repo and can be activated with:

library(errorlocate)

options(errorlocate.allow_log=TRUE)

rules <- validator(log(total_salary) >= log(n_employees) + log(min_salary))
data <- data.frame(total_salary = 10000, n_employees = 1, min_salary = 11000)
weights <- c(total_salary = 1, n_employees = 2, min_salary = 2)
locate_errors(data, rules, weights)$errors

mip <- inspect_mip(data, rules, weights)
mip$execute()