data-cleaning / validate

Professional data validation for the R environment
407 stars 39 forks source link

Confronting Data with Keys #100

Closed shabsgithub closed 5 years ago

shabsgithub commented 5 years ago

Hi, I am facing an issue when I confront data with key . I have a bunch of rules in my yaml file which I validate using the validate package and create a summary level information. I then confront the data with keys so that I could get the exact records which fail. The logic works. However it seems to be inconsistent. The field "ABC" shows 10 records failed at summary level . However when I confront the same rule with key it somehow drops this field information. Heres my code snippet:

v1<- validator(.file = "few_rules.yaml")
  # confront the data 
  cf<- confront(data , v1 )  
  out<- summary(cf)

  name  items passes  fails  nNA error warning
ABC      100    90         10
XYZ       100    95         5

##confront with key 
data$key<- paste( data$ABC,data$XYZ , sep="~" )
 v <- validator(.file = "few_rules.yaml")
 ck<- as.data.frame(confront(data, v, key= 'key'))

view(ck)
key    name   value   expression 
123   XYZ      False     is.null(XYZ) 

Can somebody please shed some light on why it drops the field "ABC"

markvanderloo commented 5 years ago

Hi there, thanks for taking time to submit your issue. I have some trouble reproducing it, because I do not have access to your data and rules. However, if I do for example this, all seems fine.

data(retailers)
retailers$key <- sprintf("RET~%02d",1:60)
head(retailers,3)
  size incl.prob staff turnover other.rev total.rev staff.costs total.costs
1  sc0      0.02    75       NA        NA      1130          NA       18915
2  sc3      0.14     9     1607        NA      1607         131        1544
3  sc3      0.14    NA     6886       -33      6919         324        6493
  profit vat    key
1  20045  NA RET~01
2     63  NA RET~02
3    426  NA RET~03
rules <- validator(turnover >=0)
cf <- confront(retailers, rules, key="key")
head(as.data.frame(cf), 3)
     key name value               expression
1 RET~01   V1    NA (turnover - 0) >= -1e-08
2 RET~02   V1  TRUE (turnover - 0) >= -1e-08
3 RET~03   V1  TRUE (turnover - 0) >= -1e-08
shabsgithub commented 5 years ago

Thanks Mark. You are right . I am not facing this issue with all the fields in the data but only this one field"ABC".I am suspecting that the values in this field is what is creating problems. This field ABC is a code field so values include something like "A","B",NA,null. So may be what I need to understand is does validate package (confronting with keys) produce accurate results when the data has NA.

markvanderloo commented 5 years ago

I think you'd have to look at what paste does when one of the pasted vectors contains NA.