data-cleaning / validate

Professional data validation for the R environment
406 stars 39 forks source link

Compare values of two columns based on complex conditions #176

Open josmos opened 1 year ago

josmos commented 1 year ago

I have a rather complex function comparing (possibly partial) date strings:

compare_partial_dates <- function(date1, date2, missing_value_pattern = "nk",  sep = ".") {
  no_y_pat <- paste(missing_value_pattern, missing_value_pattern, missing_value_pattern, sep = sep) #  nk.nk.nk
  no_m_pat <- paste(missing_value_pattern, missing_value_pattern, "", sep = sep)  #  nk.%m.%Y
  no_d_pat <- paste(missing_value_pattern, "", sep = sep) #  nk.nk.%Y
  if (is.na(date1) || is.na(date2)) {
    # missing date: no comparison possible
    return(TRUE)
  } else if (str_starts(date1, no_y_pat) == TRUE || str_starts(date2, no_y_pat) == TRUE) {
    # nk.nk.nk.: no comparison possible
    return(TRUE)
  } else if (str_starts(date1, no_m_pat) == TRUE || str_starts(date2, no_m_pat) == TRUE) {
    # missing month: set both dates to 01.01.%Y
    date1 <- paste("01", "01", substr(date1, nchar(date1) - 3, nchar(date1)), sep = ".")
    date2 <- paste("01", "01", substr(date2, nchar(date2) - 3, nchar(date2)), sep = ".")
  } else if (str_starts(date1, no_d_pat) == TRUE || str_starts(date2, no_d_pat)) {
    # missing day: set both dates to 01.%m.%Y
    date1 <- paste("01", substr(date1, nchar(date1) - 6, nchar(date1)), sep = ".")
    date2 <- paste("01", substr(date2, nchar(date2) - 6, nchar(date2)), sep = ".")
  }
  # convert to numeric date
  date1 <- as.Date(strptime(date1, format = "%d.%m.%Y", tz = "UTC"))
  date2 <- as.Date(strptime(date2, format = "%d.%m.%Y", tz = "UTC"))

  # print(paste(date1, operator, date2, sep = " "))
  # compare the numeric date values:
   return(date1 <= date2)
}

I have a lot of date-columns to compare. Making rules with simple expressions for each column combination would be a mess. Is it possible to make this comparison with validate using a function like this (or similar one)? How could this be implemented?

markvanderloo commented 1 year ago

Hi There, for any function f(...) that returns a logical vector you can create a rule like this

rules <- validator( f(x,z) == TRUE)

if you need to compare, say variables x and y to z, than you could use a variable group like so:

rules <- validator(
  G := var_group(x,y)
, f(G,z)
)
markvanderloo commented 1 year ago

The other option is to generate the rules in a file and read them later.

template <- "f(%s,z)"
txt <- paste(sprintf(template, some_vector_of_names), collapse="\n")
write(txt, file="rules.R")
rules <- validator(.file="rules.R")
akuhnle commented 1 year ago

I have a similar issue, in a previous version I was able to use the inline function A %==% B within rules, this seems to no longer be the case. Do I have to rewrite all rules that used this function to something like eq(A,B) == TRUE?


`%==%`<- function(e1,e2){ 
  if(length(e1) == length(e2)){
    isEqual <- e1 == e2 | (is.na(e1)) & (is.na(e2))
    isEqual[is.na(isEqual)] <- FALSE
    return(isEqual)
  }
  else{
   return(FALSE)
}

Thanks