Choose Data Contract Framework

sf-dcp commented 7 months ago

Related to #650. Creating a separate issue because 1) usage of data contracts can expand beyond data library (i.e. for non-dbt data products and 2) to provide space for discussion.

Motivation

As a part of the data library revamp (AKA ingest,AKA extract), we would like to introduce data contracts into data ingestion process to keep our data clean and consistent from the start. With data contracts, we can set clear rules on what data should look like and catch issues such as column name changes before data product builds, making actual builds faster. It will also provide transparency and enable cross-team collaboration, such as embedding business logic to enhance data quality. All in all, it should make everyone's life easier 🤞

Approach

There exist multiple frameworks for data validation. From my research, the main open source tools are Great Expectations and Soda. A smaller one is Pandera.

Thinking about our needs, I came up with the following guidelines to evaluate the tools:

Documentation quality
Easy-to-read data contracts/data checks
- This will help us and other teams to contribute in contract development
Integrates with a variety of storages/data formats
- In particular, integration with local files, from S3, and database
Flexible data checks
- Huge pro if it can do geospatial checks
- Allow for custom data checks
Simple to write new data checks
Integration with existing data products [new]
- Initially, the idea was to add data validation in the recipe ingestion process. Thinking about improving existing data products quality, it would be nice if a framework of our choice could be easily imbedded in existing pipelines; one example could be running a set of checks on final tables before exporting them.

Review of each framework

Pandera

link
Python-based framework
Integrates with dataframes (pandas, geopandas, dask, etc)
Contracts are written in python
Good documentation
Pros:
- Works with python data types and pandas dtypes
- Integrates with mypy for typechecking
- Can use pydentic classes to define table schemas and data checks
Cons:
- Need to define schemas for each recipe in python versus having a yaml template
- Some data contract code is not very readable (example)
- having to read data into a dataframe for validation (perhaps not a con?)

Great Expectations (GX)

link
Python-based framework
Integrates with postgres, s3, and local files via dataframes (pandas), etc (link)
Define data checks, aka expectations, in json files
Pros:
- Active slack community
- High quality documentation
- Offers data profiling which allows auto-generation of data checks
- Similar to dbt, can generate self-hosted docs in html
- Very flexible framework
- Huge library of built-in and community-contributed data checks.
- Ability to create custom data checks including geospatial
Cons:
- Steep learning curve, requires extensive setup
- Cumbersome process of defining custom checks
- Easiest to generate data checks via GX-provided Jupyter Notebooks - makes the process of development and iteration of data checks long

Soda

link
Integrates with Postgres and local files through dask dataframe or DuckDB (geospatial queries not available)
Soda has different deployment models ("flavors"). We would be interested in soda core which is their free python library and CLI tool.
Data checks are written in SodaCL language
Data checks are written in yaml
Pros:
- Active slack community. I was getting responses from Soda stuff within a few hours of posting
- Documentation is good. Not as great as GX docs
- Easy setup
- Run data checks from CLI, just providing connection info and a yaml file with checks
- Soda has built-in data checks as well as ability to create custom SQL statements like this:
Cons:
- UI for generated docs only available in SodaCloud (paid version)
- If running data contract on local files, need to use it via dask dataframe, a distributed version of pandas. The dask dataframe uses presto sql dialect which can be different from postgres and doesn't have geospatial support. It could be better to read local files into DuckDB - TBD how well it works.
- Geospatial checks may not be available - in the process of debugging. See more info at the PR level. EDIT: they work! Just need to specify the shema where geospatial functions live (usually in public)

Summary

So far, Soda seems to best fit our needs in terms of simplicity, integration with Postgres & local files, custom data checks, and readability of data contracts. I'm leaning towards integration with Postgres for simplicity. It can be easily integrated with current builds: we would need to 1) define yaml file(s) with data checks and Posgres connection info and 2) run a CLI command. Don't love that geospatial checks may not be available.

Next steps

Implement same data checks via Soda & Great Expectations and compare their implementations:

[x] #809
[x] #818

Edit: the write-up was revised/enhanced after the PRs above.

sf-dcp commented 7 months ago

I will tag the team here once I implement data checks via Soda & Great Expectations to demonstrate differences between the 2 tools. Feel free to add questions/immediate thoughts in the meantime

damonmcc commented 7 months ago

so thorough! 👏🏾 Soda seems like a winner to me

having to read local files into DuckDB seems like a fun thing for us to be forced to lol

since DuckDB has a geospatial extension, maybe that could unlock geospatial checks someday, and maybe even lettings us check FileGDBs. looks like this open issue in the soda-core repo is related: https://github.com/sodadata/soda-core/issues/1964

when we really need them, I think we can already use dbt for geospatial checks via custom tests

alexrichey commented 7 months ago

I'm actually quite intrigued by Pandera. Since we use Pydantic already, we could potentially re-use some model code. But it also looks flexible enough to hook up to our existing metadata. We'd possibly just have to write a little glue code to to parse our product metadata and translate it into a DataFrameSchema which is just a better version of what I'm already doing for package validation. And if we need to do geo specific checks, we could write those in dcpy.

I suppose for me the big question is whether we want to do validation in (geo)dataframes or in a database. For my use case, dataframes are preferable. Looks like Pandera integrates nicely with geopandas as well.

Thoughts?

damonmcc commented 7 months ago

@alexrichey on validating (geo)dataframes vs database tables:

Since we build and export from DB tables, it seems like validating tables is better than converting to and validating (geo)dataframes.

And the ability to validate files (source and packaged) seems like a significant feature we want. I guess we can always load an FBGD we've generated into a set of geodataframes, and maybe the only other alternative to validate it (via Soda) is to use DuckDB like this.

alexrichey commented 7 months ago

Since we build and export from DB tables, it seems like validating tables is better than converting to and validating (geo)dataframes.

Well, I need to validate post export from a database. And for Ingest code, it's validating pre-import to a database, right? Then for actual database stuff, we've got dbt.

For my use-case, the database is just another dependency to account for. It would make a lot of sense if our data was too large to store (and operate on) in memory, though. Which maybe is the case with something like PLUTO?

alexrichey commented 7 months ago

Would like to hear a little more about requirements on the ingest side @sf-dcp and @fvankrieken

damonmcc commented 7 months ago

Well, I need to validate post export from a database.

Totally, and those are files. So I imagine this is our ranking of preferred data formats to validate

database tables (builds)
files (source data, build exports)
geodataframes (conversions of 1 and 2)

sf-dcp commented 7 months ago

@alexrichey, for your case with distributed datasets, how do you envision integration with Pandera? Would you define data checks in a yaml file, translate them to Pandera pydentic classes and validate?

I think our ideal framework is the one that's able to work with both database and files with a minimal setup and future maintenance. Also huge bonus if it's readable enough where we collaborate on data checks with GIS or other teams.

With the case of current builds and their validation in a database: it would be nice to have something that can be implemented quickly, and I feel like dbt is not lightweight enough to do that. We would probably just want to refactor current pipelines into dbt rather than creating checks in the end.
With the case of validation at the extract time, we can validate an existing dataframe, an existing local file (which is in .parquet format), or load it into Postgres and do it there. So we are flexible with formats. My main reservation with Pandera is seeing it as internal DE framework only: i.e. it's not readable enough for other teams (if we are sharing python files). Unless we create a yaml wrapper

sf-dcp commented 7 months ago

Side note... my personal preference is working with anything but pandas dataframes because of the funkiness with data types.

When you load local data straight into geopandas, it's fine: it acts like Postgres or gdal not changing data types. On the other hand, when it comes to regular dfs, it changes data types. For example, if you got an integer column with nulls (say bbls), pandas converts the values to decimals and replaces nulls with NaN values. And this behavior persists when you convert pandas to geopandas df (case with csv files that have geospatial data)

alexrichey commented 7 months ago

Would you define data checks in a yaml file, translate them to Pandera pydentic classes and validate?

Yes, exactly. So for example, for COLP I was thinking we'd just parse metadata, and implement custom checks for things like BBLs, or wkbs, etc. I think it'd just take a little glue code. But... seems like Pandera would most easily facilitate us writing our declarative checks in the format of our choice. At a glance, it seems like it's the most lightweight and hackable.

And I feel you with pandas dataframes converting. I've certainly felt that pain, but it mostly goes away when you read in everything as a string. I suppose I'd have concerns in the opposite direction, with potential type coercion happening when importing into a database.

Maybe it makes sense for me to quickly POC what I've described?

sf-dcp commented 7 months ago

If it's a quick POC to do, then yeah, it would be helpful to see!

sf-dcp commented 7 months ago

I'm not sure Pandera is the right tool for dq checks during product builds. It would work for output files, but not for intermediate tables...

sf-dcp commented 7 months ago

Update, geospatial queries work in Soda with Postgres! @damonmcc figured it out :)

I revised the PR with code as seen above.

Next step for me is to explore DuckDB with Soda for local files.

sf-dcp commented 6 months ago

During our discussions, a few more critical factors were brought up that should be a part of the decision matrix in addition to the original criteria. Below please find the aggregated criteria with its weight. I filled out the table to conduct quantitative comparison of Pandera & Soda.

Note, a tool is measured on a scale of 1-5 and then weighted according to criteria. For example, "Easy-to-read checks" weight for Pandera is 5 (rating) x 2 (criteria weight) = 10 (total).

Open to discussion on the ratings. Hopefully, this will provide us with a framework to make the final, well-rounded decision.

Criteria	Criteria Weight	Soda	Pandera	Notes
Documentation Quality	1	4	5
Easy-to-read data checks (cross team collaboration)	2	8	10	For Pandera, assumption for the rating is defining our own yaml format to make it readable.
Enables one file for all metadata (sources, products)	2	10	10	For Soda, assumption for the rating is defining data checks within our custom yaml file and executing Soda via Python. To use CLI for Soda, a soda checks file needs to be a stand alone file.
Integrates with our data file formats (ingest, packaging)	2	10	10
Integrates with Postgres (product builds)	2	10	2	For Pandera, there is no option to connect to database directly.
Simple to add new data checks	2	10	10
Simple to write custom data checks	1	5	3
Total (raw score)	70	57	50
Total (%)	100%	81%	71%

alexrichey commented 6 months ago

@sf-dcp Would you be able to list some of your use-cases, in particular the common data checks you'd be performing, and maybe some of the more complicated cases? (especially those where Pandera might be too lightweight)

That might help us clarify useful categories and their respective weights.

Edit: Chatted with @sf-dcp and we scoped this down to just the lifecycle steps before and after the build, ie. Ingest and Package/Distribute. Although there is the lingering question of whether we'd want to potentially use SODA to QA builds, which would change the weights pretty significantly.

fvankrieken commented 5 months ago

Criteria	Criteria Weight	Soda	Pandera	Notes
Documentation Quality	0	0	0	They're both fine
Easy-to-write data checks	2	3	4	Breaking this into two. Pandera's ranking here clearly makes some assumptions of what we've set up (edit - knocked it down slightly for this reason). Soda takes knocks because of it effectively being its own language - in the examples here, keywords around invalid/valid, or all of the grammar around schema is not intuitive. This is where I'd much rather have check names and kwargs and leave it at that
Easy-to-read data checks (cross team collaboration)	2	5	4	There are a couple things that are more awkward in Pandera (`row_count > n` or stuff like that is very intuitive to read if less so to write)
Enables one file for all metadata (sources, products)	2	4	5	docking Soda a bit for not quite working out of the box for this, and I like the flexibility we have with Pandera
Integrates with our data file formats (ingest, packaging)	2	3	5	It seems like pandera slots very cleanly into packaging based on the tests Alex was writing. Doesn't mean that Soda couldn't, but working around pandas with soda just seems a tad more cumbersome. Could lower the weight of this row though maybe
Integrates with Postgres (product builds)	1	5	1	For me the weight is lower because I feel that if we're moving towards dbt in our builds, that it fits that portion nicely. Thinking now, I feel like this is something we should talk about a bit more, because dbt has remained largely out of our data validation talks. Counterpoint is that there's the barrier of actually setting up dbt for each of our products, which on its own is more overhead than Soda. So weight here is an open question.
Simple to write custom data checks	3	2	5	This depends what we're talking about. One-off custom column check? Writing a query in Soda is very easy. Validating a bbl is in proper format, and re-using that custom logic across tests? I would much rather do that in python. It doesn't seem like these valid formats are extendable sadly. Unless I'm missing something, I don't see custom reusable checks in Soda?
Simple to add new data checks	0	0	0	both are fine
Cost of maintaining code around structure of checks, custom checks	1	5	2	This is something that I want to believe won't be a big issue if we used pandera, but have to admit that there is some cost associated here. Argument could be made to weight this higher
Total (raw score)	65	46	51
Total (%)		70.7%	78.5%

The couple I weighted zero, it's more that they're not going to differentiate them for me, not that they're not important!

Going through this made a couple things clear to me

biggest sources of differing opinions here is likely on weighting more than any specific feature of one or the other
these really are different tools. Soda is a lightweight, out-of-the-box tool to declare and run data checks. Pandera is a framework to set up your own system of running data checks. The customizability of and control we'd have over the latter really pushes me towards it, but I get how that overhead might get in the way of just getting up and running with these checks, and figuring out what to do with the results of checks (which is where this whole thing will really blow up in complexity). So I'd say I have a clear preference for Pandera in theory, but see the value in taking Soda and running with it for now. For what its worth, it also seems like it would be way easier to move from Soda to Pandera than vice-versa

alexrichey commented 5 months ago

I think this is a really great rubric. For me there are a few open questions that would affect the score on SODA. Let's discuss!

Criteria	Criteria Weight	Soda	Pandera	Notes
Documentation Quality	1	3	7	The draw for me here is that Pandera has about two paragraphs of relevant documentation. It's intuitive enough that I don't need documentation.
Easy-to-read data checks (cross team collaboration)	2	8	10	For Pandera, assumption for the rating is defining our own yaml format to make it readable. (unchanged from Sasha's)
Enables one file for all metadata (sources, products)	2	Q1?	10	This is the open question for me. It wasn't clear how we'd do this with SODA - though I suppose we could always generate SODA yaml files off our product metadata
Integrates with our data file formats (ingest, packaging)	1	10	10
Integrates with Postgres (product builds)	Q2?	10	0	For Pandera, there is no option to connect to database directly. With a little bit of glue code, we could connect Pandera to any data source - but that's not a great option in certain cases (e.g. big data sets)
Simple to add new data checks	2	10	10
Simple to write custom data checks	1	5	10
Total (raw score)	70	36 + 2x(Q1?) + Q2?(10) == range of 36-76	57

alexrichey commented 5 months ago

these really are different tools. Soda is a lightweight, out-of-the-box tool to declare and run data checks. Pandera is a framework to set up your own system of running data checks

@fvankrieken My thinking is that Pandera is more akin to a library - you can adapt it however you'd like, but there's a definite cost to the open-endedness. Whereas SODA is very much a framework (somewhat akin to DBT, or, in another sphere entirely, Ruby On rails), with expected ways to do things right out of the box, and that comes with its own pros and cons.

damonmcc commented 5 months ago

Criteria	Criteria Weight	Soda	Pandera
Documentation Quality	1	4	5
Easy-to-read data checks (cross team collaboration)	2	9	4
Enables one file for all metadata (sources, products)	2	8	8
Integrates with our data file formats (ingest, packaging)	2	8	6
Integrates with Postgres (product builds)	2	10	2
Simple to add new data checks	2	8	6
Simple to write custom data checks	1	6	8
Total (raw score)	70	53	39
Total (%)	100%	XX%	XX%

sf-dcp commented 5 months ago

Per in-person discussion, Pandera wins for ingest/packaging cases. Aside the Integrates with Postgres (product builds) criteria, Pandera is the winner primarily for checks readability and integration with existing yaml files.

Regarding the ETL data checks, need more info to decide between Soda vs dbt. If we don't seek validation from external teams during the ETL process, we are leaning towards dbt.

todo (not a part of this issue):

[ ] where validation framework will live in dcpy (first will put in one place)
[ ] implement framework
[ ] document writing checks
[ ] plan how to roll out validation checks for ingest/packaging

NYCPlanning / data-engineering