DavZim / dataverifyr

A Lightweight, Flexible, and Fast Data Validation Package that Can Handle All Sizes of Data
https://davzim.github.io/dataverifyr/
Other
26 stars 1 forks source link

filter_fails() fails #4

Closed oguchihy closed 1 year ago

oguchihy commented 1 year ago
filter_fails(res, df)
#> Error in parse(text = expr, keep.source = TRUE) : 
#>   <text>:2:0: unexpected end of input
#> 1: !(str_detect(str_to_lower(Pat.Type), str_to_lower("medical")) & 
#>   ^

Here is the code snippet:

rules <- ruleset(
    rule(Pat.Type !="Billing Only"),
    rule(!is.na(Encounter)),
    rule(!is.na(Pat.Type)),
    rule(!(duplicated(Encounter) | duplicated(Encounter, fromLast = TRUE))),
    rule(!(str_detect(str_to_lower(Pat.Name), str_to_lower("test")) | str_detect(str_to_lower(Pat.Name), paste0("^", str_to_lower("zz"))))),
    rule(...)
   )

Other functions (plot_res(), for example,) are working just fine off the same script, in spite of the filter_fail() error.

DavZim commented 1 year ago

Can you post a small dataset that you use the filter for the fails as well?!

oguchihy commented 1 year ago

I am unable to send the info, but can say that the dataset sets used in the package as examples work fine; and here is my plot Rplot output:

Rplot

oguchihy commented 1 year ago

Also, while at it, could you consider adding the option to customize the name of the rule, to go after "Rule for:"? Thanks.

DavZim commented 1 year ago

You can set the names in the rule(!is.na(Encounter), name = "no NA in Encounter") (see also the doc). See also the recommended approach of constructing rules in a separate yaml file (see also the later part of the Example Section of the Readme), where you explicitly write out the names of the rules. This is a lot handier when you have lots of rules.

Now to your issue. There are still two issues for me before I can troubleshoot your main issue. The rule that throws the error is not shown (starts with !(str_detect(str_to_lower(Pat.Type), str_to_lower("medical")) & as the error message mentions. Can you show me the full rule? Second, I don't have any data to test again. Feel free to create an example dataset like this.

# adjust this so that you can still reproduce the error
df <- data.frame(
  Pat.Type = c("A", "B", "C"),
  medical = c("a", "b", "c")
)
oguchihy commented 1 year ago

Thanks; yaml works well; will update Names. To the main point: 1) unable to send df because has personal identifiable data; however, here's the structure:

str_no_values(rules.dt) 'data.frame': 19367 obs. of 9 variables: $ Loc.Name : chr "character" "character" "character" "character" ... $ Enc.Rendering: chr "character" "character" "character" "character" ... $ Encounter : chr "integer" "integer" "integer" "integer" ... $ Fst.Consult : chr "character" "character" "character" "character" ... $ Pat.Name : chr "character" "character" "character" "character" ... $ Per.Nbr : chr "integer" "integer" "integer" "integer" ... $ Dt.of.Svc : chr "Date" "Date" "Date" "Date" ... $ Pat.Type : chr "character" "character" "character" "character" ... $ Count : chr "character" "character" "character" "character" ... rules in yaml.txt

  1. Including yaml file so you can see all of the rules.
  2. Still, the error:

    filter_fails(res, rules.dt) Error in parse(text = expr, keep.source = TRUE) :

    :2:0: unexpected end of input 1: !(str_detect(str_to_lower(Pat.Name), str_to_lower("test")) | ^ here's the full line: rule(!(str_detect(str_to_lower(Pat.Name), str_to_lower("test")) | str_detect(str_to_lower(Pat.Name), paste0("^", str_to_lower("zz")))))

If I comment out that line, the same error rolls over to the expression on the next line. Sorry, that's the best I can do: hope it helps! Thanks again.

DavZim commented 1 year ago

I think I found the issue. The latest version on CRAN does not work with multiline rules. Eg the following would break with the error you found:

 r <- rule(mpg > 10 &
              cyl %in% c(4, 6, 8) |
              disp > 10)
#> Warning message:
#> In substr(expr, 1, 1) == "\"" && substr(expr, nchar(expr), nchar(expr)) ==  :
#>   'length(x) = 2 > 1' in coercion to 'logical(1)'

The current version of the package should fix that. Eg

library(dataverifyr)
library(stringr)

df <- data.frame(
  Pat.Type = c("medical stuff", "other", "something MEDICAL"),
  Loc.Name = c("dental procedure", "ERROR", "DENTAL fixture")
)

rs <- ruleset(
  # all Pat.Type are NOT medical and all Loc.Name are NOT dental...
  rule(str_detect(str_to_lower(Pat.Type), str_to_lower("medical")) & 
         str_detect(str_to_lower(Loc.Name), str_to_lower("dental")),
       negate = TRUE)
)

res <- check_data(df, rs)
filter_fails(res, df)
#>             Pat.Type         Loc.Name
#> 1:     medical stuff dental procedure
#> 2: something MEDICAL   DENTAL fixture

Created on 2023-07-18 by the reprex package (v2.0.1)

Can you try to upgrade your package with devtools::install_github("DavZim/dataverifyr") and check if it works? I am happy to send an update to CRAN once we can confirm that this was the cause.

oguchihy commented 1 year ago

Works great now! Thanks.

oguchihy commented 1 year ago

This is only a suggestion, so didn't want to open a new issue, although, based on "rules" (pardon the pun), you might want me to elevate it to that status. filter_fails(., ..., per_rule = TRUE) works well. Might we also find a way to use it to filter the fails of selected rules, say, for instances, where one wants to deal with particular rules result, and not all of them? Thanks again for a handy package!

DavZim commented 1 year ago

What exactly do you have in mind? Can you maybe write some high-level pseudocode of what you want to achieve with this? I am always interested in expanding the functionality, so I'm all-ears!

oguchihy commented 1 year ago

I have this check_data() result (I have selected desired columns):

                                                                   name tests pass fail
 1:                                    "Billing Only" in Pat.Type Field  9845 9843    2
 2:                                               NA in Encounter Field  9845 9845    0
 3:                                                NA in Pat.Type Field  9845 9845    0
 4:                                       Duplicates in Encounter Field  9845 9841    4
 5:                                    Test or zztest in Pat.Name Field  9845 9845    0
 6:                   Mismatched Medical visit & Dental Clinic location  9845 9845    0
 7:                   Mismatched Dental visit & Medical Clinic location  9845 9845    0
 8:                                         "Labs" in Fst.Consult Field  9845 9845    0
 9:                                          Blank in Fst.Consult Field  9845  624 9221
10:                                             NA in Fst.Consult Field  9845  624 9221
11:               Exclusive (only correct) Visit-type in Pat.Type Field  9845 9321  524
12:        Exclusive (only correct) CPSP Providers in Fst.Consult Field  9845 9693  152
13:                          Behavioral Health matches service location  9845 9844    1
14: Match CPSP provider FM in Fst.Consult to correct clinics (Loc.Name)  9845 9845    0
15: Match CPSP provider GM in Fst.Consult to correct clinics (Loc.Name)  9845 9845    0
16: Match CPSP provider SJ in Fst.Consult to correct clinics (Loc.Name)  9845 9845    0
17: Match CPSP provider NS in Fst.Consult to correct clinics (Loc.Name)  9845 9845    0
18: Match CPSP provider GB in Fst.Consult to correct clinics (Loc.Name)  9845 9845    0
19: Match CPSP provider RM in Fst.Consult to correct clinics (Loc.Name)  9845 9845    0

I want to filter filter_fails() by selected row numbers, say, 1:5,13

DavZim commented 1 year ago

As the result of check_data() is a standard data.frame, you can do all you want with the data like with every data.frame.

For example

library(dataverifyr)

rs <- ruleset(
  rule(mpg > 10),
  rule(cyl %in% c(4, 6)), # missing 8
  rule(qsec >= 14.5 & qsec <= 20.9)
)

res <- check_data(mtcars, rs)

filter_fails(res, mtcars, per_rule = TRUE)
#> $`cyl %in% c(4, 6)`
#>      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#>  1: 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#>  2: 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#>  3: 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#>  4: 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#>  5: 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#>  6: 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#>  7: 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#>  8: 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#>  9: 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> 10: 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> 11: 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> 12: 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> 13: 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> 14: 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> 
#> $`qsec >= 14.5 & qsec <= 20.9`
#>     mpg cyl  disp hp drat   wt qsec vs am gear carb
#> 1: 22.8   4 140.8 95 3.92 3.15 22.9  1  0    4    2

# take only the first and third rule
res |>
  dplyr::slice(c(1, 3)) |>
  filter_fails(mtcars, per_rule = TRUE)
#> $`qsec >= 14.5 & qsec <= 20.9`
#>     mpg cyl  disp hp drat   wt qsec vs am gear carb
#> 1: 22.8   4 140.8 95 3.92 3.15 22.9  1  0    4    2

# filter only rules which contain a '>' sign
res |>
  dplyr::filter(stringr::str_detect(expr, ">")) |>
  filter_fails(mtcars, per_rule = TRUE)
#> $`qsec >= 14.5 & qsec <= 20.9`
#>     mpg cyl  disp hp drat   wt qsec vs am gear carb
#> 1: 22.8   4 140.8 95 3.92 3.15 22.9  1  0    4    2

Created on 2023-07-19 by the reprex package (v2.0.1)

oguchihy commented 1 year ago

Got it -- thanks. I wasn't using the original df (mtcars here) with filter_fails while after slicing.

oguchihy commented 1 year ago

Not sure why I all of a sudden started getting this error:

preprocess.err <- preprocess1 %>%
  dplyr::slice(c(1:8, 13)) %>%
  filter_fails(rules.dt, per_rule = TRUE)
#> Error in parse(text = e) : <text>:1:3: unexpected ')'
#> 1: !()
#>    ^

Here’s my code:

daily_visit_rules.preprocess <- read_rules("daily_visit_rules1.yaml")
rules.dt <- readRDS("Original Daily Visits from Server Report.RDS")
preprocess1 <- check_data(rules.dt, daily_visit_rules.preprocess) 
preprocess <- preprocess1 %>% select(name, tests, pass, fail)
preprocess

preprocess.dt <- filter_fails(daily_visit_rules.preprocess, rules.dt, per_rule = TRUE)
preprocess.err <- preprocess1 %>%
  dplyr::slice(c(1:8, 13)) %>%
  filter_fails(rules.dt, per_rule = TRUE)
DavZim commented 1 year ago

seems like there is an error in one of the rules...

oguchihy commented 1 year ago

Any help identifying which one? They are all posted above. I did double check the syntax of each rule, even with ChatGPT and came out clean. Any help is appreciated! Thanks again.

oguchihy commented 1 year ago

Also, all results are fine, except for this particular function. For example, after I correct errors and run a post-correction check, I get expected outcome.

oguchihy commented 1 year ago

In all the different variations I have tried,

preprocess.err <- preprocess1 %>%
  dplyr::slice(c(1:8, 13))

works fine, including using the expression:

preprocess.err <- preprocess1 %>%
 filter(fail <10)

The only part that is consistently throwing up the same error is the last line:

 ...%>%
  filter_fails(rules.dt, per_rule = TRUE)

with or without per_rule

DavZim commented 1 year ago

I think I found and fixed the culprit: you had rules with multiple lines. I fixed this in the latet development version. Can you try again?

library(dataverifyr)
library(stringr)
library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#> 
#>     filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#> 
#>     intersect, setdiff, setequal, union

url <- "https://github.com/DavZim/dataverifyr/files/12070624/rules.in.yaml.txt"
file <- tempfile(fileext = ".yaml")
download.file(url, file)
rules <- read_rules(file)
rules
#> <Verification Ruleset with 14 elements>
#>   [1] 'Rule for: Pat.Type' matching `Pat.Type != "Billing Only"` (allow_na: FALSE)
#>   [2] 'Rule for: Encounter' matching `!is.na(Encounter)` (allow_na: FALSE)
#>   [3] 'Rule for: Pat.Type' matching `!is.na(Pat.Type)` (allow_na: FALSE)
#>   ... +11 more. Use print(ruleset, n = 10) to print more.

# example data
rules.dt <- data.frame(
  Pat.Type = "A",
  Encounter = "B",
  Pat.Name = "C",
  Loc.Name = "D",
  Fst.Consult = "E"
)

preprocess1 <- check_data(rules.dt, rules)
preprocess1
#>                                name
#>  1:              Rule for: Pat.Type
#>  2:             Rule for: Encounter
#>  3:              Rule for: Pat.Type
#>  4:             Rule for: Encounter
#>  5:              Rule for: Pat.Name
#>  6:    Rule for: Pat.Type, Loc.Name
#>  7:    Rule for: Loc.Name, Pat.Type
#>  8:           Rule for: Fst.Consult
#>  9:           Rule for: Fst.Consult
#> 10:              Rule for: Pat.Type
#> 11: Rule for: Fst.Consult, Loc.Name
#> 12: Rule for: Fst.Consult, Loc.Name
#> 13: Rule for: Fst.Consult, Loc.Name
#> 14: Rule for: Fst.Consult, Loc.Name
#>                                                                                                                                                                                                          expr
#>  1:                                                                                                                                                                                Pat.Type != "Billing Only"
#>  2:                                                                                                                                                                                         !is.na(Encounter)
#>  3:                                                                                                                                                                                          !is.na(Pat.Type)
#>  4:                                                                                                                                         !(duplicated(Encounter) | duplicated(Encounter, fromLast = TRUE))
#>  5:                                                                   !(str_detect(str_to_lower(Pat.Name), str_to_lower("test")) | \n    str_detect(str_to_lower(Pat.Name), paste0("^", str_to_lower("zz"))))
#>  6:                                                                         !(str_detect(str_to_lower(Pat.Type), str_to_lower("medical")) & \n    str_detect(str_to_lower(Loc.Name), str_to_lower("dental")))
#>  7:                                                                         !(str_detect(str_to_lower(Loc.Name), str_to_lower("medical")) & \n    str_detect(str_to_lower(Pat.Type), str_to_lower("dental")))
#>  8:                                                                                                                                    !str_detect(Fst.Consult, "Labs") | str_detect(Fst.Consult, "^\\\\s*$")
#>  9:                                                                                                                                                                                       !is.na(Fst.Consult)
#> 10:                                                                                          Pat.Type %in% c("CPSP Visits", "Dental Visit", "Medical Visit", \n    "Optometry Visit", "Telephone - Billable")
#> 11:                                                                                        nzchar(trimws(Fst.Consult)) | (Fst.Consult == "Fernandez, Maria" & \n    Loc.Name == "Castroville Medical Clinic")
#> 12: nzchar(trimws(Fst.Consult)) | (Fst.Consult == "Gaytan, Melvis" & \n    (Loc.Name == "North Main Medical Clinic" | Loc.Name == "Soledad Medical Clinic" | \n        Loc.Name == "Sanborn Medical Clinic"))
#> 13:                nzchar(trimws(Fst.Consult)) | (Fst.Consult == "Silva, Judy" & \n    Loc.Name == "Greenfield Medical Clinic" | Fst.Consult == \n    "Silva, Judy" & Loc.Name == "King City Medical Clinic")
#> 14:                                                                                            nzchar(trimws(Fst.Consult)) | (Fst.Consult == "Nunez, Saturnino" & \n    Loc.Name == "Sanborn Medical Clinic")
#>     allow_na negate tests pass fail warn error              time
#>  1:    FALSE  FALSE     1    1    0            0.0042879581 secs
#>  2:    FALSE  FALSE     1    1    0            0.0003609657 secs
#>  3:    FALSE  FALSE     1    1    0            0.0003187656 secs
#>  4:    FALSE  FALSE     1    1    0            0.0003311634 secs
#>  5:    FALSE  FALSE     1    1    0            0.0010800362 secs
#>  6:    FALSE  FALSE     1    1    0            0.0004789829 secs
#>  7:    FALSE  FALSE     1    1    0            0.0003829002 secs
#>  8:     TRUE  FALSE     1    1    0            0.0003590584 secs
#>  9:    FALSE  FALSE     1    1    0            0.0002939701 secs
#> 10:    FALSE  FALSE     1    0    1            0.0024039745 secs
#> 11:    FALSE  FALSE     1    1    0            0.0005309582 secs
#> 12:    FALSE  FALSE     1    1    0            0.0005221367 secs
#> 13:    FALSE  FALSE     1    1    0            0.0004880428 secs
#> 14:    FALSE  FALSE     1    1    0            0.0004830360 secs

preprocess1 |> 
  slice(c(1:8, 13))
#>                               name
#> 1:              Rule for: Pat.Type
#> 2:             Rule for: Encounter
#> 3:              Rule for: Pat.Type
#> 4:             Rule for: Encounter
#> 5:              Rule for: Pat.Name
#> 6:    Rule for: Pat.Type, Loc.Name
#> 7:    Rule for: Loc.Name, Pat.Type
#> 8:           Rule for: Fst.Consult
#> 9: Rule for: Fst.Consult, Loc.Name
#>                                                                                                                                                                                          expr
#> 1:                                                                                                                                                                 Pat.Type != "Billing Only"
#> 2:                                                                                                                                                                          !is.na(Encounter)
#> 3:                                                                                                                                                                           !is.na(Pat.Type)
#> 4:                                                                                                                          !(duplicated(Encounter) | duplicated(Encounter, fromLast = TRUE))
#> 5:                                                    !(str_detect(str_to_lower(Pat.Name), str_to_lower("test")) | \n    str_detect(str_to_lower(Pat.Name), paste0("^", str_to_lower("zz"))))
#> 6:                                                          !(str_detect(str_to_lower(Pat.Type), str_to_lower("medical")) & \n    str_detect(str_to_lower(Loc.Name), str_to_lower("dental")))
#> 7:                                                          !(str_detect(str_to_lower(Loc.Name), str_to_lower("medical")) & \n    str_detect(str_to_lower(Pat.Type), str_to_lower("dental")))
#> 8:                                                                                                                     !str_detect(Fst.Consult, "Labs") | str_detect(Fst.Consult, "^\\\\s*$")
#> 9: nzchar(trimws(Fst.Consult)) | (Fst.Consult == "Silva, Judy" & \n    Loc.Name == "Greenfield Medical Clinic" | Fst.Consult == \n    "Silva, Judy" & Loc.Name == "King City Medical Clinic")
#>    allow_na negate tests pass fail warn error              time
#> 1:    FALSE  FALSE     1    1    0            0.0042879581 secs
#> 2:    FALSE  FALSE     1    1    0            0.0003609657 secs
#> 3:    FALSE  FALSE     1    1    0            0.0003187656 secs
#> 4:    FALSE  FALSE     1    1    0            0.0003311634 secs
#> 5:    FALSE  FALSE     1    1    0            0.0010800362 secs
#> 6:    FALSE  FALSE     1    1    0            0.0004789829 secs
#> 7:    FALSE  FALSE     1    1    0            0.0003829002 secs
#> 8:     TRUE  FALSE     1    1    0            0.0003590584 secs
#> 9:    FALSE  FALSE     1    1    0            0.0004880428 secs

preprocess1 |> 
  slice(c(1:8, 13)) |> 
  filter_fails(x = rules.dt, per_rule = TRUE)
#> [1] Pat.Type    Encounter   Pat.Name    Loc.Name    Fst.Consult
#> <0 Zeilen> (oder row.names mit Länge 0)

Created on 2023-07-21 by the reprex package (v2.0.1)