NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
25 stars 1 forks source link

Choose Data Contract Framework #769

Closed sf-dcp closed 5 months ago

sf-dcp commented 7 months ago

Related to #650. Creating a separate issue because 1) usage of data contracts can expand beyond data library (i.e. for non-dbt data products and 2) to provide space for discussion.

Motivation

As a part of the data library revamp (AKA ingest,AKA extract), we would like to introduce data contracts into data ingestion process to keep our data clean and consistent from the start. With data contracts, we can set clear rules on what data should look like and catch issues such as column name changes before data product builds, making actual builds faster. It will also provide transparency and enable cross-team collaboration, such as embedding business logic to enhance data quality. All in all, it should make everyone's life easier 🤞

Approach

There exist multiple frameworks for data validation. From my research, the main open source tools are Great Expectations and Soda. A smaller one is Pandera.

Thinking about our needs, I came up with the following guidelines to evaluate the tools:

Review of each framework

Pandera

Great Expectations (GX)

Soda

Summary

So far, Soda seems to best fit our needs in terms of simplicity, integration with Postgres & local files, custom data checks, and readability of data contracts. I'm leaning towards integration with Postgres for simplicity. It can be easily integrated with current builds: we would need to 1) define yaml file(s) with data checks and Posgres connection info and 2) run a CLI command. Don't love that geospatial checks may not be available.

Next steps

Implement same data checks via Soda & Great Expectations and compare their implementations:

Edit: the write-up was revised/enhanced after the PRs above.

sf-dcp commented 7 months ago

I will tag the team here once I implement data checks via Soda & Great Expectations to demonstrate differences between the 2 tools. Feel free to add questions/immediate thoughts in the meantime

damonmcc commented 7 months ago

so thorough! 👏🏾 Soda seems like a winner to me

having to read local files into DuckDB seems like a fun thing for us to be forced to lol

since DuckDB has a geospatial extension, maybe that could unlock geospatial checks someday, and maybe even lettings us check FileGDBs. looks like this open issue in the soda-core repo is related: https://github.com/sodadata/soda-core/issues/1964

when we really need them, I think we can already use dbt for geospatial checks via custom tests

alexrichey commented 7 months ago

I'm actually quite intrigued by Pandera. Since we use Pydantic already, we could potentially re-use some model code. But it also looks flexible enough to hook up to our existing metadata. We'd possibly just have to write a little glue code to to parse our product metadata and translate it into a DataFrameSchema which is just a better version of what I'm already doing for package validation. And if we need to do geo specific checks, we could write those in dcpy.

I suppose for me the big question is whether we want to do validation in (geo)dataframes or in a database. For my use case, dataframes are preferable. Looks like Pandera integrates nicely with geopandas as well.

Thoughts?

damonmcc commented 7 months ago

@alexrichey on validating (geo)dataframes vs database tables:

Since we build and export from DB tables, it seems like validating tables is better than converting to and validating (geo)dataframes.

And the ability to validate files (source and packaged) seems like a significant feature we want. I guess we can always load an FBGD we've generated into a set of geodataframes, and maybe the only other alternative to validate it (via Soda) is to use DuckDB like this.

alexrichey commented 7 months ago

Since we build and export from DB tables, it seems like validating tables is better than converting to and validating (geo)dataframes.

Well, I need to validate post export from a database. And for Ingest code, it's validating pre-import to a database, right? Then for actual database stuff, we've got dbt.

For my use-case, the database is just another dependency to account for. It would make a lot of sense if our data was too large to store (and operate on) in memory, though. Which maybe is the case with something like PLUTO?

alexrichey commented 7 months ago

Would like to hear a little more about requirements on the ingest side @sf-dcp and @fvankrieken

damonmcc commented 7 months ago

Well, I need to validate post export from a database.

Totally, and those are files. So I imagine this is our ranking of preferred data formats to validate

  1. database tables (builds)
  2. files (source data, build exports)
  3. geodataframes (conversions of 1 and 2)
sf-dcp commented 7 months ago

@alexrichey, for your case with distributed datasets, how do you envision integration with Pandera? Would you define data checks in a yaml file, translate them to Pandera pydentic classes and validate?

I think our ideal framework is the one that's able to work with both database and files with a minimal setup and future maintenance. Also huge bonus if it's readable enough where we collaborate on data checks with GIS or other teams.

sf-dcp commented 7 months ago

Side note... my personal preference is working with anything but pandas dataframes because of the funkiness with data types.

When you load local data straight into geopandas, it's fine: it acts like Postgres or gdal not changing data types. On the other hand, when it comes to regular dfs, it changes data types. For example, if you got an integer column with nulls (say bbls), pandas converts the values to decimals and replaces nulls with NaN values. And this behavior persists when you convert pandas to geopandas df (case with csv files that have geospatial data)

alexrichey commented 7 months ago

Would you define data checks in a yaml file, translate them to Pandera pydentic classes and validate?

Yes, exactly. So for example, for COLP I was thinking we'd just parse metadata, and implement custom checks for things like BBLs, or wkbs, etc. I think it'd just take a little glue code. But... seems like Pandera would most easily facilitate us writing our declarative checks in the format of our choice. At a glance, it seems like it's the most lightweight and hackable.

And I feel you with pandas dataframes converting. I've certainly felt that pain, but it mostly goes away when you read in everything as a string. I suppose I'd have concerns in the opposite direction, with potential type coercion happening when importing into a database.

Maybe it makes sense for me to quickly POC what I've described?

sf-dcp commented 7 months ago

If it's a quick POC to do, then yeah, it would be helpful to see!

sf-dcp commented 7 months ago

I'm not sure Pandera is the right tool for dq checks during product builds. It would work for output files, but not for intermediate tables...

sf-dcp commented 7 months ago

Update, geospatial queries work in Soda with Postgres! @damonmcc figured it out :)

image

I revised the PR with code as seen above.

Next step for me is to explore DuckDB with Soda for local files.

sf-dcp commented 6 months ago

During our discussions, a few more critical factors were brought up that should be a part of the decision matrix in addition to the original criteria. Below please find the aggregated criteria with its weight. I filled out the table to conduct quantitative comparison of Pandera & Soda.

Note, a tool is measured on a scale of 1-5 and then weighted according to criteria. For example, "Easy-to-read checks" weight for Pandera is 5 (rating) x 2 (criteria weight) = 10 (total).

Open to discussion on the ratings. Hopefully, this will provide us with a framework to make the final, well-rounded decision.

Criteria Criteria Weight Soda Pandera Notes
Documentation Quality 1 4 5
Easy-to-read data checks (cross team collaboration) 2 8 10 For Pandera, assumption for the rating is defining our own yaml format to make it readable.
Enables one file for all metadata (sources, products) 2 10 10 For Soda, assumption for the rating is
defining data checks within our custom yaml file and executing Soda via Python. To use CLI for Soda, a soda checks file needs to be a stand alone file.
Integrates with our data file formats (ingest, packaging) 2 10 10
Integrates with Postgres (product builds) 2 10 2 For Pandera, there is no option to connect to database directly.
Simple to add new data checks 2 10 10
Simple to write custom data checks 1 5 3
Total (raw score) 70 57 50
Total (%) 100% 81% 71%
alexrichey commented 6 months ago

@sf-dcp Would you be able to list some of your use-cases, in particular the common data checks you'd be performing, and maybe some of the more complicated cases? (especially those where Pandera might be too lightweight)

That might help us clarify useful categories and their respective weights.

Edit: Chatted with @sf-dcp and we scoped this down to just the lifecycle steps before and after the build, ie. Ingest and Package/Distribute. Although there is the lingering question of whether we'd want to potentially use SODA to QA builds, which would change the weights pretty significantly.

fvankrieken commented 5 months ago
Criteria Criteria Weight Soda Pandera Notes
Documentation Quality 0 0 0 They're both fine
Easy-to-write data checks 2 3 4 Breaking this into two. Pandera's ranking here clearly makes some assumptions of what we've set up (edit - knocked it down slightly for this reason). Soda takes knocks because of it effectively being its own language - in the examples here, keywords around invalid/valid, or all of the grammar around schema is not intuitive. This is where I'd much rather have check names and kwargs and leave it at that
Easy-to-read data checks (cross team collaboration) 2 5 4 There are a couple things that are more awkward in Pandera (row_count > n or stuff like that is very intuitive to read if less so to write)
Enables one file for all metadata (sources, products) 2 4 5 docking Soda a bit for not quite working out of the box for this, and I like the flexibility we have with Pandera
Integrates with our data file formats (ingest, packaging) 2 3 5 It seems like pandera slots very cleanly into packaging based on the tests Alex was writing. Doesn't mean that Soda couldn't, but working around pandas with soda just seems a tad more cumbersome. Could lower the weight of this row though maybe
Integrates with Postgres (product builds) 1 5 1 For me the weight is lower because I feel that if we're moving towards dbt in our builds, that it fits that portion nicely. Thinking now, I feel like this is something we should talk about a bit more, because dbt has remained largely out of our data validation talks. Counterpoint is that there's the barrier of actually setting up dbt for each of our products, which on its own is more overhead than Soda. So weight here is an open question.
Simple to write custom data checks 3 2 5 This depends what we're talking about. One-off custom column check? Writing a query in Soda is very easy. Validating a bbl is in proper format, and re-using that custom logic across tests? I would much rather do that in python. It doesn't seem like these valid formats are extendable sadly. Unless I'm missing something, I don't see custom reusable checks in Soda?
Simple to add new data checks 0 0 0 both are fine
Cost of maintaining code around structure of checks, custom checks 1 5 2 This is something that I want to believe won't be a big issue if we used pandera, but have to admit that there is some cost associated here. Argument could be made to weight this higher
Total (raw score) 65 46 51
Total (%) 70.7% 78.5%

The couple I weighted zero, it's more that they're not going to differentiate them for me, not that they're not important!

Going through this made a couple things clear to me

alexrichey commented 5 months ago

I think this is a really great rubric. For me there are a few open questions that would affect the score on SODA. Let's discuss!

Criteria Criteria Weight Soda Pandera Notes
Documentation Quality 1 3 7 The draw for me here is that Pandera has about two paragraphs of relevant documentation. It's intuitive enough that I don't need documentation.
Easy-to-read data checks (cross team collaboration) 2 8 10 For Pandera, assumption for the rating is defining our own yaml format to make it readable. (unchanged from Sasha's)
Enables one file for all metadata (sources, products) 2 Q1? 10 This is the open question for me. It wasn't clear how we'd do this with SODA - though I suppose we could always generate SODA yaml files off our product metadata
Integrates with our data file formats (ingest, packaging) 1 10 10
Integrates with Postgres (product builds) Q2? 10 0 For Pandera, there is no option to connect to database directly. With a little bit of glue code, we could connect Pandera to any data source - but that's not a great option in certain cases (e.g. big data sets)
Simple to add new data checks 2 10 10
Simple to write custom data checks 1 5 10
Total (raw score) 70 36 + 2x(Q1?) + Q2?(10) == range of 36-76 57
alexrichey commented 5 months ago

these really are different tools. Soda is a lightweight, out-of-the-box tool to declare and run data checks. Pandera is a framework to set up your own system of running data checks

@fvankrieken My thinking is that Pandera is more akin to a library - you can adapt it however you'd like, but there's a definite cost to the open-endedness. Whereas SODA is very much a framework (somewhat akin to DBT, or, in another sphere entirely, Ruby On rails), with expected ways to do things right out of the box, and that comes with its own pros and cons.

damonmcc commented 5 months ago
Criteria Criteria Weight Soda Pandera Notes
Documentation Quality 1 4 5
Easy-to-read data checks (cross team collaboration) 2 9 4
Enables one file for all metadata (sources, products) 2 8 8
Integrates with our data file formats (ingest, packaging) 2 8 6
Integrates with Postgres (product builds) 2 10 2
Simple to add new data checks 2 8 6
Simple to write custom data checks 1 6 8
Total (raw score) 70 53 39
Total (%) 100% XX% XX%
sf-dcp commented 5 months ago

Per in-person discussion, Pandera wins for ingest/packaging cases. Aside the Integrates with Postgres (product builds) criteria, Pandera is the winner primarily for checks readability and integration with existing yaml files.

Regarding the ETL data checks, need more info to decide between Soda vs dbt. If we don't seek validation from external teams during the ETL process, we are leaning towards dbt.

todo (not a part of this issue):