A large amount of our work involves handling large datasets in Python or R and using these for statistical analyses. We would like to have automated tools for validating properties of these datasets.
Decision Drivers
Ease of use
Expressiveness (ability to define complex tests)
Reusability of tests across different projects with same data type
Applicability of package to different data sources (e.g. CSV, SQL, DataBricks)
Compatibility with existing R/Python workflows
Considered Options
Great Expectations (easy to use interactively and creates reusable expectation suites)
PyDeequ (Works well with DataBricks but not applicable to other data sources)
Dlookr R package (more compatible with R workflows; less expressive and reusable than e.g. GX)
Writing our own custom code (more expressive for custom checks but less reusable and less easy to use)
Context and Problem Statement
A large amount of our work involves handling large datasets in Python or R and using these for statistical analyses. We would like to have automated tools for validating properties of these datasets.
Decision Drivers
Considered Options