Closed beniaminogreen closed 11 months ago
Hi Ben,
thanks for the PR (and on a related note congrats to the sucess of zoomerjoin <3).
I like the idea of having an option to combine rulesets. I am at the moment unsure whether I would want to overload the +
operator or have a function like data.table::rbindlist()
(maybe bindlist(list(ruleset1, ruleset2, ruleset3))
). Maybe both would be possible.
Also, I would like to have a couple of tests that cover some use-cases. And last but not least, feel free to add yourself to the DESCRIPTION file.
Best, David
I very much like the suggestion to add a bind_rules
or something similar to allow users to programatically combine a list of rulsets and rules. I've gone ahead and implemented this. Under the hood, I've done this by improving the overloading of +
so that you can now also add:
rule
with a rule
rule
with a ruleset
ruleset
with a rule
ruleset
with a ruleset
Here's a snippet showing off this new behavior.
rule_1 <- rule(mpg > 10 & mpg < 30) # mpg goes up to 34
rule_2 <- rule(cyl %in% c(4, 8)) # missing 6 cyl
rule_3 <- rule(vs %in% c(0, 1), allow_na = TRUE)
# combine rules with rules
rule_1 + rule_2
#> <Verification Ruleset with 2 elements>
#> [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#> [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
# combine rulesets with rules
ruleset(rule_1, rule_2) + rule_3
#> <Verification Ruleset with 3 elements>
#> [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#> [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#> [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)
# combine rules with rulesets
rule_1 + ruleset(rule_2, rule_3)
#> <Verification Ruleset with 3 elements>
#> [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#> [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#> [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)
# rulesets with rulesets
ruleset(rule_1) + ruleset(rule_2, rule_3)
#> <Verification Ruleset with 3 elements>
#> [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#> [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#> [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)
# combine a list of rulsets and rules with `bind_rules`
bind_rules(list(rule_1, rule_2, ruleset(rule_3)))
#> <Verification Ruleset with 3 elements>
#> [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#> [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#> [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)
Created on 2023-10-16 with reprex v2.0.2
If you think this is a promising direction, I would love to spruce up the documentation, and can add some examples. I've also gone ahead and added some tests to verify that the +
method and bind_rules are working as expected.
I have added two comments to your code. I like where this is going, the only thing that I think is missing are duplications in rules. How do we want to handle them? Drop duplications or do we want to keep them? What do you think makes more sense?
Just made both of those changes - thanks for taking a look at them.
How to handle duplicates is a good question. I think that keeping or removing duplicates makes sense as long as the documentation is very explicit about it.
I have just made a change to remove duplicates when adding two sets together. This makes the behavior consistent with what a user might expect when taking the union of two sets - duplicates will automatically be removed, because sets don't contain duplicates.
Looks good to me. I'll run the tests and then merge the PR.
There is one error in the automated tests https://github.com/DavZim/dataverifyr/actions/runs/6547484442/job/17788344146#step:6:137
Thanks for flagging this. I think the github runner is set to be more strict than the one on my PC. Just fixed the warning and the note it raised.
Edit: I think the issue was I was using devtools::test() not devtools::check() because I did not have duckdb installed on the machine I am using.
Sorry, realized there was another issue. Just tested with devtools::check_man() and devtools::check() and it all looks good now.
Hi there, thanks for writing such a fantastic package!
I'm writing to ask if you can implement the
+
operator to combine several rulesets. This would allow users to combine several rulesets programatically and concisely.I often find I would like to do this if I have sets of rules to check several smaller datasets which then get combined into a larger dataset . I would like to be able to easily define checks on the smaller datasets and then combine them all and run them on the larger datasets using the
+
operator.Here's a quick example on how this could work.
Threw this together in about 10 mins, so I am ofc happy to go back and forth and make iterations until it's something you are happy with, if you think this is a good direction to go in.
Best, Ben