Data validation in Python - Githubissues

HealthDataInsight / way_of_working-for-hdi

MIT License

0 stars 0 forks source link

Data validation in Python #3

Open andrewcboardman opened 4 months ago

andrewcboardman commented 4 months ago

Context and Problem Statement

A large amount of our work involves handling large datasets in Python or R and using these for statistical analyses. We would like to have automated tools for validating properties of these datasets.

Decision Drivers

Ease of use
Expressiveness (ability to define complex tests)
Reusability of tests across different projects with same data type
Applicability of package to different data sources (e.g. CSV, SQL, DataBricks)
Compatibility with existing R/Python workflows

Considered Options

Great Expectations (easy to use interactively and creates reusable expectation suites)
PyDeequ (Works well with DataBricks but not applicable to other data sources)
Dlookr R package (more compatible with R workflows; less expressive and reusable than e.g. GX)
Writing our own custom code (more expressive for custom checks but less reusable and less easy to use)