DavZim / dataverifyr

A Lightweight, Flexible, and Fast Data Validation Package that Can Handle All Sizes of Data
https://davzim.github.io/dataverifyr/
Other
26 stars 1 forks source link

Feature request: Add `+` for rulesets #9

Closed beniaminogreen closed 11 months ago

beniaminogreen commented 11 months ago

Hi there, thanks for writing such a fantastic package!

I'm writing to ask if you can implement the + operator to combine several rulesets. This would allow users to combine several rulesets programatically and concisely.

I often find I would like to do this if I have sets of rules to check several smaller datasets which then get combined into a larger dataset . I would like to be able to easily define checks on the smaller datasets and then combine them all and run them on the larger datasets using the + operator.

Here's a quick example on how this could work.

ruleset_a <- ruleset(
  rule(mpg > 10 & mpg < 30) # mpg goes up to 34
)
rulesset_b <- ruleset(
  rule(cyl %in% c(4, 8)), # missing 6 cyl
  rule(vs %in% c(0, 1), allow_na = TRUE)
)

ruleset_a+ruleset_b 
#> <Verification Ruleset with 3 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#>   [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)

Threw this together in about 10 mins, so I am ofc happy to go back and forth and make iterations until it's something you are happy with, if you think this is a good direction to go in.

Best, Ben

DavZim commented 11 months ago

Hi Ben,

thanks for the PR (and on a related note congrats to the sucess of zoomerjoin <3).

I like the idea of having an option to combine rulesets. I am at the moment unsure whether I would want to overload the + operator or have a function like data.table::rbindlist() (maybe bindlist(list(ruleset1, ruleset2, ruleset3))). Maybe both would be possible.

Also, I would like to have a couple of tests that cover some use-cases. And last but not least, feel free to add yourself to the DESCRIPTION file.

Best, David

beniaminogreen commented 11 months ago

I very much like the suggestion to add a bind_rules or something similar to allow users to programatically combine a list of rulsets and rules. I've gone ahead and implemented this. Under the hood, I've done this by improving the overloading of + so that you can now also add:

Here's a snippet showing off this new behavior.

rule_1 <- rule(mpg > 10 & mpg < 30) # mpg goes up to 34
rule_2 <- rule(cyl %in% c(4, 8)) # missing 6 cyl
rule_3 <- rule(vs %in% c(0, 1), allow_na = TRUE)

# combine rules with rules
rule_1 + rule_2 
#> <Verification Ruleset with 2 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)

# combine rulesets with rules
ruleset(rule_1, rule_2) + rule_3
#> <Verification Ruleset with 3 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#>   [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)

# combine rules with rulesets
rule_1 + ruleset(rule_2, rule_3) 
#> <Verification Ruleset with 3 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#>   [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)

# rulesets with rulesets
ruleset(rule_1) + ruleset(rule_2, rule_3) 
#> <Verification Ruleset with 3 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#>   [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)

# combine a list of rulsets and rules with `bind_rules`
bind_rules(list(rule_1, rule_2, ruleset(rule_3)))
#> <Verification Ruleset with 3 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#>   [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)

Created on 2023-10-16 with reprex v2.0.2

If you think this is a promising direction, I would love to spruce up the documentation, and can add some examples. I've also gone ahead and added some tests to verify that the + method and bind_rules are working as expected.

DavZim commented 11 months ago

I have added two comments to your code. I like where this is going, the only thing that I think is missing are duplications in rules. How do we want to handle them? Drop duplications or do we want to keep them? What do you think makes more sense?

beniaminogreen commented 11 months ago

Just made both of those changes - thanks for taking a look at them.

How to handle duplicates is a good question. I think that keeping or removing duplicates makes sense as long as the documentation is very explicit about it.

I have just made a change to remove duplicates when adding two sets together. This makes the behavior consistent with what a user might expect when taking the union of two sets - duplicates will automatically be removed, because sets don't contain duplicates.

DavZim commented 11 months ago

Looks good to me. I'll run the tests and then merge the PR.

DavZim commented 11 months ago

There is one error in the automated tests https://github.com/DavZim/dataverifyr/actions/runs/6547484442/job/17788344146#step:6:137

beniaminogreen commented 11 months ago

Thanks for flagging this. I think the github runner is set to be more strict than the one on my PC. Just fixed the warning and the note it raised.

Edit: I think the issue was I was using devtools::test() not devtools::check() because I did not have duckdb installed on the machine I am using.

beniaminogreen commented 11 months ago

Sorry, realized there was another issue. Just tested with devtools::check_man() and devtools::check() and it all looks good now.