Closed oguchihy closed 1 year ago
Can you post a small dataset that you use the filter for the fails as well?!
I am unable to send the info, but can say that the dataset sets used in the package as examples work fine; and here is my plot output:
Also, while at it, could you consider adding the option to customize the name of the rule, to go after "Rule for:"? Thanks.
You can set the names in the rule(!is.na(Encounter), name = "no NA in Encounter")
(see also the doc).
See also the recommended approach of constructing rules in a separate yaml file (see also the later part of the Example Section of the Readme), where you explicitly write out the names of the rules. This is a lot handier when you have lots of rules.
Now to your issue. There are still two issues for me before I can troubleshoot your main issue. The rule that throws the error is not shown (starts with !(str_detect(str_to_lower(Pat.Type), str_to_lower("medical")) &
as the error message mentions. Can you show me the full rule?
Second, I don't have any data to test again. Feel free to create an example dataset like this.
# adjust this so that you can still reproduce the error
df <- data.frame(
Pat.Type = c("A", "B", "C"),
medical = c("a", "b", "c")
)
Thanks; yaml works well; will update Names. To the main point: 1) unable to send df because has personal identifiable data; however, here's the structure:
str_no_values(rules.dt) 'data.frame': 19367 obs. of 9 variables: $ Loc.Name : chr "character" "character" "character" "character" ... $ Enc.Rendering: chr "character" "character" "character" "character" ... $ Encounter : chr "integer" "integer" "integer" "integer" ... $ Fst.Consult : chr "character" "character" "character" "character" ... $ Pat.Name : chr "character" "character" "character" "character" ... $ Per.Nbr : chr "integer" "integer" "integer" "integer" ... $ Dt.of.Svc : chr "Date" "Date" "Date" "Date" ... $ Pat.Type : chr "character" "character" "character" "character" ... $ Count : chr "character" "character" "character" "character" ... rules in yaml.txt
filter_fails(res, rules.dt) Error in parse(text = expr, keep.source = TRUE) :
:2:0: unexpected end of input 1: !(str_detect(str_to_lower(Pat.Name), str_to_lower("test")) | ^ here's the full line: rule(!(str_detect(str_to_lower(Pat.Name), str_to_lower("test")) | str_detect(str_to_lower(Pat.Name), paste0("^", str_to_lower("zz")))))
If I comment out that line, the same error rolls over to the expression on the next line. Sorry, that's the best I can do: hope it helps! Thanks again.
I think I found the issue. The latest version on CRAN does not work with multiline rules. Eg the following would break with the error you found:
r <- rule(mpg > 10 &
cyl %in% c(4, 6, 8) |
disp > 10)
#> Warning message:
#> In substr(expr, 1, 1) == "\"" && substr(expr, nchar(expr), nchar(expr)) == :
#> 'length(x) = 2 > 1' in coercion to 'logical(1)'
The current version of the package should fix that. Eg
library(dataverifyr)
library(stringr)
df <- data.frame(
Pat.Type = c("medical stuff", "other", "something MEDICAL"),
Loc.Name = c("dental procedure", "ERROR", "DENTAL fixture")
)
rs <- ruleset(
# all Pat.Type are NOT medical and all Loc.Name are NOT dental...
rule(str_detect(str_to_lower(Pat.Type), str_to_lower("medical")) &
str_detect(str_to_lower(Loc.Name), str_to_lower("dental")),
negate = TRUE)
)
res <- check_data(df, rs)
filter_fails(res, df)
#> Pat.Type Loc.Name
#> 1: medical stuff dental procedure
#> 2: something MEDICAL DENTAL fixture
Created on 2023-07-18 by the reprex package (v2.0.1)
Can you try to upgrade your package with devtools::install_github("DavZim/dataverifyr")
and check if it works? I am happy to send an update to CRAN once we can confirm that this was the cause.
Works great now! Thanks.
This is only a suggestion, so didn't want to open a new issue, although, based on "rules" (pardon the pun), you might want me to elevate it to that status. filter_fails(., ..., per_rule = TRUE) works well. Might we also find a way to use it to filter the fails of selected rules, say, for instances, where one wants to deal with particular rules result, and not all of them? Thanks again for a handy package!
What exactly do you have in mind? Can you maybe write some high-level pseudocode of what you want to achieve with this? I am always interested in expanding the functionality, so I'm all-ears!
I have this check_data() result (I have selected desired columns):
name tests pass fail
1: "Billing Only" in Pat.Type Field 9845 9843 2
2: NA in Encounter Field 9845 9845 0
3: NA in Pat.Type Field 9845 9845 0
4: Duplicates in Encounter Field 9845 9841 4
5: Test or zztest in Pat.Name Field 9845 9845 0
6: Mismatched Medical visit & Dental Clinic location 9845 9845 0
7: Mismatched Dental visit & Medical Clinic location 9845 9845 0
8: "Labs" in Fst.Consult Field 9845 9845 0
9: Blank in Fst.Consult Field 9845 624 9221
10: NA in Fst.Consult Field 9845 624 9221
11: Exclusive (only correct) Visit-type in Pat.Type Field 9845 9321 524
12: Exclusive (only correct) CPSP Providers in Fst.Consult Field 9845 9693 152
13: Behavioral Health matches service location 9845 9844 1
14: Match CPSP provider FM in Fst.Consult to correct clinics (Loc.Name) 9845 9845 0
15: Match CPSP provider GM in Fst.Consult to correct clinics (Loc.Name) 9845 9845 0
16: Match CPSP provider SJ in Fst.Consult to correct clinics (Loc.Name) 9845 9845 0
17: Match CPSP provider NS in Fst.Consult to correct clinics (Loc.Name) 9845 9845 0
18: Match CPSP provider GB in Fst.Consult to correct clinics (Loc.Name) 9845 9845 0
19: Match CPSP provider RM in Fst.Consult to correct clinics (Loc.Name) 9845 9845 0
I want to filter filter_fails() by selected row numbers, say, 1:5,13
As the result of check_data()
is a standard data.frame, you can do all you want with the data like with every data.frame.
For example
library(dataverifyr)
rs <- ruleset(
rule(mpg > 10),
rule(cyl %in% c(4, 6)), # missing 8
rule(qsec >= 14.5 & qsec <= 20.9)
)
res <- check_data(mtcars, rs)
filter_fails(res, mtcars, per_rule = TRUE)
#> $`cyl %in% c(4, 6)`
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 2: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> 3: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> 4: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> 5: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> 6: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 7: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> 8: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 9: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> 10: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> 11: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> 12: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> 13: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 14: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#>
#> $`qsec >= 14.5 & qsec <= 20.9`
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1: 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
# take only the first and third rule
res |>
dplyr::slice(c(1, 3)) |>
filter_fails(mtcars, per_rule = TRUE)
#> $`qsec >= 14.5 & qsec <= 20.9`
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1: 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
# filter only rules which contain a '>' sign
res |>
dplyr::filter(stringr::str_detect(expr, ">")) |>
filter_fails(mtcars, per_rule = TRUE)
#> $`qsec >= 14.5 & qsec <= 20.9`
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1: 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
Created on 2023-07-19 by the reprex package (v2.0.1)
Got it -- thanks. I wasn't using the original df (mtcars here) with filter_fails while after slicing.
Not sure why I all of a sudden started getting this error:
preprocess.err <- preprocess1 %>%
dplyr::slice(c(1:8, 13)) %>%
filter_fails(rules.dt, per_rule = TRUE)
#> Error in parse(text = e) : <text>:1:3: unexpected ')'
#> 1: !()
#> ^
Here’s my code:
daily_visit_rules.preprocess <- read_rules("daily_visit_rules1.yaml")
rules.dt <- readRDS("Original Daily Visits from Server Report.RDS")
preprocess1 <- check_data(rules.dt, daily_visit_rules.preprocess)
preprocess <- preprocess1 %>% select(name, tests, pass, fail)
preprocess
preprocess.dt <- filter_fails(daily_visit_rules.preprocess, rules.dt, per_rule = TRUE)
preprocess.err <- preprocess1 %>%
dplyr::slice(c(1:8, 13)) %>%
filter_fails(rules.dt, per_rule = TRUE)
seems like there is an error in one of the rules...
Any help identifying which one? They are all posted above. I did double check the syntax of each rule, even with ChatGPT and came out clean. Any help is appreciated! Thanks again.
Also, all results are fine, except for this particular function. For example, after I correct errors and run a post-correction check, I get expected outcome.
In all the different variations I have tried,
preprocess.err <- preprocess1 %>%
dplyr::slice(c(1:8, 13))
works fine, including using the expression:
preprocess.err <- preprocess1 %>%
filter(fail <10)
The only part that is consistently throwing up the same error is the last line:
...%>%
filter_fails(rules.dt, per_rule = TRUE)
with or without per_rule
I think I found and fixed the culprit: you had rules with multiple lines. I fixed this in the latet development version. Can you try again?
library(dataverifyr)
library(stringr)
library(dplyr)
#>
#> Attache Paket: 'dplyr'
#> Die folgenden Objekte sind maskiert von 'package:stats':
#>
#> filter, lag
#> Die folgenden Objekte sind maskiert von 'package:base':
#>
#> intersect, setdiff, setequal, union
url <- "https://github.com/DavZim/dataverifyr/files/12070624/rules.in.yaml.txt"
file <- tempfile(fileext = ".yaml")
download.file(url, file)
rules <- read_rules(file)
rules
#> <Verification Ruleset with 14 elements>
#> [1] 'Rule for: Pat.Type' matching `Pat.Type != "Billing Only"` (allow_na: FALSE)
#> [2] 'Rule for: Encounter' matching `!is.na(Encounter)` (allow_na: FALSE)
#> [3] 'Rule for: Pat.Type' matching `!is.na(Pat.Type)` (allow_na: FALSE)
#> ... +11 more. Use print(ruleset, n = 10) to print more.
# example data
rules.dt <- data.frame(
Pat.Type = "A",
Encounter = "B",
Pat.Name = "C",
Loc.Name = "D",
Fst.Consult = "E"
)
preprocess1 <- check_data(rules.dt, rules)
preprocess1
#> name
#> 1: Rule for: Pat.Type
#> 2: Rule for: Encounter
#> 3: Rule for: Pat.Type
#> 4: Rule for: Encounter
#> 5: Rule for: Pat.Name
#> 6: Rule for: Pat.Type, Loc.Name
#> 7: Rule for: Loc.Name, Pat.Type
#> 8: Rule for: Fst.Consult
#> 9: Rule for: Fst.Consult
#> 10: Rule for: Pat.Type
#> 11: Rule for: Fst.Consult, Loc.Name
#> 12: Rule for: Fst.Consult, Loc.Name
#> 13: Rule for: Fst.Consult, Loc.Name
#> 14: Rule for: Fst.Consult, Loc.Name
#> expr
#> 1: Pat.Type != "Billing Only"
#> 2: !is.na(Encounter)
#> 3: !is.na(Pat.Type)
#> 4: !(duplicated(Encounter) | duplicated(Encounter, fromLast = TRUE))
#> 5: !(str_detect(str_to_lower(Pat.Name), str_to_lower("test")) | \n str_detect(str_to_lower(Pat.Name), paste0("^", str_to_lower("zz"))))
#> 6: !(str_detect(str_to_lower(Pat.Type), str_to_lower("medical")) & \n str_detect(str_to_lower(Loc.Name), str_to_lower("dental")))
#> 7: !(str_detect(str_to_lower(Loc.Name), str_to_lower("medical")) & \n str_detect(str_to_lower(Pat.Type), str_to_lower("dental")))
#> 8: !str_detect(Fst.Consult, "Labs") | str_detect(Fst.Consult, "^\\\\s*$")
#> 9: !is.na(Fst.Consult)
#> 10: Pat.Type %in% c("CPSP Visits", "Dental Visit", "Medical Visit", \n "Optometry Visit", "Telephone - Billable")
#> 11: nzchar(trimws(Fst.Consult)) | (Fst.Consult == "Fernandez, Maria" & \n Loc.Name == "Castroville Medical Clinic")
#> 12: nzchar(trimws(Fst.Consult)) | (Fst.Consult == "Gaytan, Melvis" & \n (Loc.Name == "North Main Medical Clinic" | Loc.Name == "Soledad Medical Clinic" | \n Loc.Name == "Sanborn Medical Clinic"))
#> 13: nzchar(trimws(Fst.Consult)) | (Fst.Consult == "Silva, Judy" & \n Loc.Name == "Greenfield Medical Clinic" | Fst.Consult == \n "Silva, Judy" & Loc.Name == "King City Medical Clinic")
#> 14: nzchar(trimws(Fst.Consult)) | (Fst.Consult == "Nunez, Saturnino" & \n Loc.Name == "Sanborn Medical Clinic")
#> allow_na negate tests pass fail warn error time
#> 1: FALSE FALSE 1 1 0 0.0042879581 secs
#> 2: FALSE FALSE 1 1 0 0.0003609657 secs
#> 3: FALSE FALSE 1 1 0 0.0003187656 secs
#> 4: FALSE FALSE 1 1 0 0.0003311634 secs
#> 5: FALSE FALSE 1 1 0 0.0010800362 secs
#> 6: FALSE FALSE 1 1 0 0.0004789829 secs
#> 7: FALSE FALSE 1 1 0 0.0003829002 secs
#> 8: TRUE FALSE 1 1 0 0.0003590584 secs
#> 9: FALSE FALSE 1 1 0 0.0002939701 secs
#> 10: FALSE FALSE 1 0 1 0.0024039745 secs
#> 11: FALSE FALSE 1 1 0 0.0005309582 secs
#> 12: FALSE FALSE 1 1 0 0.0005221367 secs
#> 13: FALSE FALSE 1 1 0 0.0004880428 secs
#> 14: FALSE FALSE 1 1 0 0.0004830360 secs
preprocess1 |>
slice(c(1:8, 13))
#> name
#> 1: Rule for: Pat.Type
#> 2: Rule for: Encounter
#> 3: Rule for: Pat.Type
#> 4: Rule for: Encounter
#> 5: Rule for: Pat.Name
#> 6: Rule for: Pat.Type, Loc.Name
#> 7: Rule for: Loc.Name, Pat.Type
#> 8: Rule for: Fst.Consult
#> 9: Rule for: Fst.Consult, Loc.Name
#> expr
#> 1: Pat.Type != "Billing Only"
#> 2: !is.na(Encounter)
#> 3: !is.na(Pat.Type)
#> 4: !(duplicated(Encounter) | duplicated(Encounter, fromLast = TRUE))
#> 5: !(str_detect(str_to_lower(Pat.Name), str_to_lower("test")) | \n str_detect(str_to_lower(Pat.Name), paste0("^", str_to_lower("zz"))))
#> 6: !(str_detect(str_to_lower(Pat.Type), str_to_lower("medical")) & \n str_detect(str_to_lower(Loc.Name), str_to_lower("dental")))
#> 7: !(str_detect(str_to_lower(Loc.Name), str_to_lower("medical")) & \n str_detect(str_to_lower(Pat.Type), str_to_lower("dental")))
#> 8: !str_detect(Fst.Consult, "Labs") | str_detect(Fst.Consult, "^\\\\s*$")
#> 9: nzchar(trimws(Fst.Consult)) | (Fst.Consult == "Silva, Judy" & \n Loc.Name == "Greenfield Medical Clinic" | Fst.Consult == \n "Silva, Judy" & Loc.Name == "King City Medical Clinic")
#> allow_na negate tests pass fail warn error time
#> 1: FALSE FALSE 1 1 0 0.0042879581 secs
#> 2: FALSE FALSE 1 1 0 0.0003609657 secs
#> 3: FALSE FALSE 1 1 0 0.0003187656 secs
#> 4: FALSE FALSE 1 1 0 0.0003311634 secs
#> 5: FALSE FALSE 1 1 0 0.0010800362 secs
#> 6: FALSE FALSE 1 1 0 0.0004789829 secs
#> 7: FALSE FALSE 1 1 0 0.0003829002 secs
#> 8: TRUE FALSE 1 1 0 0.0003590584 secs
#> 9: FALSE FALSE 1 1 0 0.0004880428 secs
preprocess1 |>
slice(c(1:8, 13)) |>
filter_fails(x = rules.dt, per_rule = TRUE)
#> [1] Pat.Type Encounter Pat.Name Loc.Name Fst.Consult
#> <0 Zeilen> (oder row.names mit Länge 0)
Created on 2023-07-21 by the reprex package (v2.0.1)
Here is the code snippet:
Other functions (
plot_res()
, for example,) are working just fine off the same script, in spite of thefilter_fail()
error.