DavZim / dataverifyr

A Lightweight, Flexible, and Fast Data Validation Package that Can Handle All Sizes of Data
https://davzim.github.io/dataverifyr/
Other
26 stars 1 forks source link

Count duplicated values #11

Closed Joe-Heffer-Shef closed 10 months ago

Joe-Heffer-Shef commented 10 months ago

Is there a way to count the number of duplicated values for a column? Or is this package just for validating individual cells in a data frame?

For example, I'd like to be able to define a rule that has a result like this:

# Rule name: data$my_column is unique
result["fail"] = sum(duplicated(data$my_column))
DavZim commented 10 months ago

At the moment this package only allows the validation of data on a row-level. There is #10, which adds a describe() function that also counts the unique values, but this is not a rule per se.

But you can use the !duplicated(var) rule to check if a variable has only unique values. Is this what you had in mind?

library(dataverifyr)

rs <- ruleset(
  rule(!duplicated(uniq)),
  rule(!duplicated(non_uniq))
)

data <- data.frame(
  uniq = 1:3,
  non_uniq = c(1, 1, 2)
)

check_data(data, rs)
#> # A tibble: 2 × 10
#>   name               expr    allow_na negate tests  pass  fail warn  error time 
#>   <chr>              <chr>   <lgl>    <lgl>  <int> <int> <int> <chr> <chr> <drt>
#> 1 Rule for: uniq     !dupli… FALSE    FALSE      3     3     0 ""    ""    0.00…
#> 2 Rule for: non_uniq !dupli… FALSE    FALSE      3     2     1 ""    ""    0.00…

Created on 2023-10-24 with reprex v2.0.2