Framework to 'test' datasets

sushmitavgopalan16 commented 4 years ago

I'd love to think through and document (if it already exists) or develop (if it doesn't) a framework to test datasets. I found the absence of this to be a pain point when I'd receive frequent updates to datasets with minor differences each time.

Perhaps something like -

df %>%
    expect_strings(var1, var2, var5) %>%
    expect_factors(var10, var3) %>%
    expect_unique_values(var7) %>%
    expect_values_in(var8, c('red', 'yellow'))

Ideally, these would fail loudly and very specifically.

And also something similar to compare two datasets?

sharlagelfand commented 4 years ago

@sushmitavgopalan16 i think you'd find assertr super helpful for doing this! i feel like it's quite a hidden gem, though, and documenting/publicizing its use would be awesome.

library(assertr)
library(dplyr)

# When everything passes, the original data is returned:
iris %>%
  assert(is.numeric, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
  assert(is.factor, Species) %>%
  mutate(id = row_number()) %>%
  assert(is_uniq, id) %>%
  assert(in_set("virginica", "versicolor", "setosa"), Species) %>%
  head()
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
#> 1          5.1         3.5          1.4         0.2  setosa  1
#> 2          4.9         3.0          1.4         0.2  setosa  2
#> 3          4.7         3.2          1.3         0.2  setosa  3
#> 4          4.6         3.1          1.5         0.2  setosa  4
#> 5          5.0         3.6          1.4         0.2  setosa  5
#> 6          5.4         3.9          1.7         0.4  setosa  6

# When something fails, it does so spectacularly!
iris %>%
  assert(is.character, Species)
#> Column 'Species' violates assertion 'is.character' 150 times
#>     verb redux_fn    predicate  column index  value
#> 1 assert       NA is.character Species     1 setosa
#> 2 assert       NA is.character Species     2 setosa
#> 3 assert       NA is.character Species     3 setosa
#> 4 assert       NA is.character Species     4 setosa
#> 5 assert       NA is.character Species     5 setosa
#>   [omitted 145 rows]
#> Error: assertr stopped execution

# You can change the "error" function, too
iris %>%
  assert(is.character, Species, error_fun = error_logical)
#> [1] FALSE

sushmitavgopalan16 commented 4 years ago

@sharlagelfand this is perfect! i was going to link to your new blog post for part 2 of this :)

emilyriederer commented 4 years ago

There's also now the pointblank R package. I'm loving this pkg because it's super user-friendly like assertr but it can handle remote back-ends, generate reporting, etc. It seems very analogous to the python package Great Expectations that has been getting a lot of buzz lately. Would definitely love to see it get some PR!

emilyriederer commented 4 years ago

@sushmitavgopalan16 I added a 'documentation' label for now thinking that potentially this project could create 'recipes' or further illustrate / publicize these pkgs. Let me know if you think that's appropriate or if you'd rather a tag denoting a new package / project!

eddelbuettel commented 4 years ago

I sat in last year's useR tutorial for for Statistical Data Cleaning with R because I already knew Mark and Edwin. There is a lot more in

an entire org at GitHub by them
stuff for their 2017 Wiley book on the topic

It may be worthwhile taking a look at this -- they are doing professionally at Statistics Netherlands -- just to avoid reinventing a wheel or two.

markvanderloo commented 4 years ago

The best place to start is our paper that was recently accepted by JSS.

Here's an example:

> library(validate)
> library(magrittr)
> iris %>% check_that(Sepal.Width >= 0, Sepal.Length < 50)
Object of class 'validation'
Call:
    check_that(., Sepal.Width >= 0, Sepal.Length < 50)

Confrontations: 2
With fails    : 0
Warnings      : 0
Errors        : 0
> iris %>% check_that(Sepal.Width >= 0, Sepal.Length < 50) %>% summary()
  name items passes fails nNA error warning                  expression
1   V1   150    150     0   0 FALSE   FALSE (Sepal.Width - 0) >= -1e-08
2   V2   150    150     0   0 FALSE   FALSE           Sepal.Length < 50

Or, to get data output:

> iris %>% check_that(Sepal.Width >= 0, Sepal.Length < 50) %>% 
+    as.data.frame() %>%
+    head()
  name value                  expression
1   V1  TRUE (Sepal.Width - 0) >= -1e-08
2   V1  TRUE (Sepal.Width - 0) >= -1e-08
3   V1  TRUE (Sepal.Width - 0) >= -1e-08
4   V1  TRUE (Sepal.Width - 0) >= -1e-08
5   V1  TRUE (Sepal.Width - 0) >= -1e-08
6   V1  TRUE (Sepal.Width - 0) >= -1e-08

But you can also externalize the checks to a text file or database, annotate them, and so on.

chircollab / chircollab20

Framework to 'test' datasets #2