Open sushmitavgopalan16 opened 4 years ago
@sushmitavgopalan16 i think you'd find assertr super helpful for doing this! i feel like it's quite a hidden gem, though, and documenting/publicizing its use would be awesome.
library(assertr)
library(dplyr)
# When everything passes, the original data is returned:
iris %>%
assert(is.numeric, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
assert(is.factor, Species) %>%
mutate(id = row_number()) %>%
assert(is_uniq, id) %>%
assert(in_set("virginica", "versicolor", "setosa"), Species) %>%
head()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
#> 1 5.1 3.5 1.4 0.2 setosa 1
#> 2 4.9 3.0 1.4 0.2 setosa 2
#> 3 4.7 3.2 1.3 0.2 setosa 3
#> 4 4.6 3.1 1.5 0.2 setosa 4
#> 5 5.0 3.6 1.4 0.2 setosa 5
#> 6 5.4 3.9 1.7 0.4 setosa 6
# When something fails, it does so spectacularly!
iris %>%
assert(is.character, Species)
#> Column 'Species' violates assertion 'is.character' 150 times
#> verb redux_fn predicate column index value
#> 1 assert NA is.character Species 1 setosa
#> 2 assert NA is.character Species 2 setosa
#> 3 assert NA is.character Species 3 setosa
#> 4 assert NA is.character Species 4 setosa
#> 5 assert NA is.character Species 5 setosa
#> [omitted 145 rows]
#> Error: assertr stopped execution
# You can change the "error" function, too
iris %>%
assert(is.character, Species, error_fun = error_logical)
#> [1] FALSE
@sharlagelfand this is perfect! i was going to link to your new blog post for part 2 of this :)
There's also now the pointblank
R package. I'm loving this pkg because it's super user-friendly like assertr
but it can handle remote back-ends, generate reporting, etc. It seems very analogous to the python package Great Expectations that has been getting a lot of buzz lately. Would definitely love to see it get some PR!
@sushmitavgopalan16 I added a 'documentation' label for now thinking that potentially this project could create 'recipes' or further illustrate / publicize these pkgs. Let me know if you think that's appropriate or if you'd rather a tag denoting a new package / project!
I sat in last year's useR tutorial for for Statistical Data Cleaning with R because I already knew Mark and Edwin. There is a lot more in
It may be worthwhile taking a look at this -- they are doing professionally at Statistics Netherlands -- just to avoid reinventing a wheel or two.
The best place to start is our paper that was recently accepted by JSS.
Here's an example:
> library(validate)
> library(magrittr)
> iris %>% check_that(Sepal.Width >= 0, Sepal.Length < 50)
Object of class 'validation'
Call:
check_that(., Sepal.Width >= 0, Sepal.Length < 50)
Confrontations: 2
With fails : 0
Warnings : 0
Errors : 0
> iris %>% check_that(Sepal.Width >= 0, Sepal.Length < 50) %>% summary()
name items passes fails nNA error warning expression
1 V1 150 150 0 0 FALSE FALSE (Sepal.Width - 0) >= -1e-08
2 V2 150 150 0 0 FALSE FALSE Sepal.Length < 50
Or, to get data output:
> iris %>% check_that(Sepal.Width >= 0, Sepal.Length < 50) %>%
+ as.data.frame() %>%
+ head()
name value expression
1 V1 TRUE (Sepal.Width - 0) >= -1e-08
2 V1 TRUE (Sepal.Width - 0) >= -1e-08
3 V1 TRUE (Sepal.Width - 0) >= -1e-08
4 V1 TRUE (Sepal.Width - 0) >= -1e-08
5 V1 TRUE (Sepal.Width - 0) >= -1e-08
6 V1 TRUE (Sepal.Width - 0) >= -1e-08
But you can also externalize the checks to a text file or database, annotate them, and so on.
I'd love to think through and document (if it already exists) or develop (if it doesn't) a framework to test datasets. I found the absence of this to be a pain point when I'd receive frequent updates to datasets with minor differences each time.
Perhaps something like -
Ideally, these would fail loudly and very specifically.
And also something similar to compare two datasets?