data-cleaning / validate

Professional data validation for the R environment
404 stars 39 forks source link

Something like mapply() for validation rules #108

Open matthiasgomolka opened 4 years ago

matthiasgomolka commented 4 years ago

Hi Mark, I am in the situation that I want to check for many columns if they contain only the values from their respective codelists. From what I know, there is no shortcut fro writing these kinds of rules, since the use of more than one var_group() results in the cartesian product of these groups.

So what I would find helpful is the following:

  1. I have a list of variables and
  2. a list of codelists of the same length.

Within the definition of a validation rule, I would like to use something like (pseudo-code):

mapply(function(var, codelist) {var %in% codelist},
       var = var_group(var_A, var_B, var_C), 
       codelist = list(cl_A, cl_B, cl_C), 
) 

So this should map over both var and codelist and thus create only three validation rules when fed into validator().

To make this even more clear, maybe have a look at how map() is used as a transformation within the {drake} package: https://books.ropensci.org/drake/static.html#map This deviates from the pseudo-code above but might be a better way to actually implement this? (I have no idea)

What are your thoughts on this?

markvanderloo commented 4 years ago

Hi Matthias, I think we should support something for this. One thing you can do is externalize the code lists as follows:

library(validate)

dat <- data.frame(
    x = c("a","a","v","c","b")
  , y = c("321","321","123","231","444")
)

codelists <- list(
    foo = c("a","b","c")
  , bar = c("123","231","312","213","132","321") 
)

rules <- validator(
    x %in% foo
  , y %in% bar
)

out <- confront(dat, rules, ref=codelists)
summary(out)