NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Choose Data Contract Framework #769

Open sf-dcp opened 2 weeks ago

sf-dcp commented 2 weeks ago

Related to #650. Creating a separate issue because 1) usage of data contracts can expand beyond data library (i.e. for non-dbt data products and 2) to provide space for discussion.

Motivation

As a part of the data library revamp (AKA ingest,AKA extract), we would like to introduce data contracts into data ingestion process to keep our data clean and consistent from the start. With data contracts, we can set clear rules on what data should look like and catch issues such as column name changes before data product builds, making actual builds faster. It will also provide transparency and enable cross-team collaboration, such as embedding business logic to enhance data quality. All in all, it should make everyone's life easier 🤞

Approach

There exist multiple frameworks for data validation. From my research, the main open source tools are Great Expectations and Soda. A smaller one is Pandera.

Thinking about our needs, I came up with the following guidelines to evaluate the tools:

Review of each framework

Pandera

Great Expectations (GX)

Soda

Summary

So far, Soda seems to best fit our needs in terms of simplicity, integration with Postgres & local files, custom data checks, and readability of data contracts. I'm leaning towards integration with Postgres for simplicity. It can be easily integrated with current builds: we would need to 1) define yaml file(s) with data checks and Posgres connection info and 2) run a CLI command. Don't love that geospatial checks may not be available.

Next steps

Implement same data checks via Soda & Great Expectations and compare their implementations:

Edit: the write-up was revised/enhanced after the PRs above.

sf-dcp commented 2 weeks ago

I will tag the team here once I implement data checks via Soda & Great Expectations to demonstrate differences between the 2 tools. Feel free to add questions/immediate thoughts in the meantime

damonmcc commented 4 days ago

so thorough! 👏🏾 Soda seems like a winner to me

having to read local files into DuckDB seems like a fun thing for us to be forced to lol

since DuckDB has a geospatial extension, maybe that could unlock geospatial checks someday, and maybe even lettings us check FileGDBs. looks like this open issue in the soda-core repo is related: https://github.com/sodadata/soda-core/issues/1964

when we really need them, I think we can already use dbt for geospatial checks via custom tests

alexrichey commented 3 days ago

I'm actually quite intrigued by Pandera. Since we use Pydantic already, we could potentially re-use some model code. But it also looks flexible enough to hook up to our existing metadata. We'd possibly just have to write a little glue code to to parse our product metadata and translate it into a DataFrameSchema which is just a better version of what I'm already doing for package validation. And if we need to do geo specific checks, we could write those in dcpy.

I suppose for me the big question is whether we want to do validation in (geo)dataframes or in a database. For my use case, dataframes are preferable. Looks like Pandera integrates nicely with geopandas as well.

Thoughts?

damonmcc commented 3 days ago

@alexrichey on validating (geo)dataframes vs database tables:

Since we build and export from DB tables, it seems like validating tables is better than converting to and validating (geo)dataframes.

And the ability to validate files (source and packaged) seems like a significant feature we want. I guess we can always load an FBGD we've generated into a set of geodataframes, and maybe the only other alternative to validate it (via Soda) is to use DuckDB like this.

alexrichey commented 3 days ago

Since we build and export from DB tables, it seems like validating tables is better than converting to and validating (geo)dataframes.

Well, I need to validate post export from a database. And for Ingest code, it's validating pre-import to a database, right? Then for actual database stuff, we've got dbt.

For my use-case, the database is just another dependency to account for. It would make a lot of sense if our data was too large to store (and operate on) in memory, though. Which maybe is the case with something like PLUTO?

alexrichey commented 3 days ago

Would like to hear a little more about requirements on the ingest side @sf-dcp and @fvankrieken

damonmcc commented 3 days ago

Well, I need to validate post export from a database.

Totally, and those are files. So I imagine this is our ranking of preferred data formats to validate

  1. database tables (builds)
  2. files (source data, build exports)
  3. geodataframes (conversions of 1 and 2)
sf-dcp commented 3 days ago

@alexrichey, for your case with distributed datasets, how do you envision integration with Pandera? Would you define data checks in a yaml file, translate them to Pandera pydentic classes and validate?

I think our ideal framework is the one that's able to work with both database and files with a minimal setup and future maintenance. Also huge bonus if it's readable enough where we collaborate on data checks with GIS or other teams.

sf-dcp commented 3 days ago

Side note... my personal preference is working with anything but pandas dataframes because of the funkiness with data types.

When you load local data straight into geopandas, it's fine: it acts like Postgres or gdal not changing data types. On the other hand, when it comes to regular dfs, it changes data types. For example, if you got an integer column with nulls (say bbls), pandas converts the values to decimals and replaces nulls with NaN values. And this behavior persists when you convert pandas to geopandas df (case with csv files that have geospatial data)

alexrichey commented 3 days ago

Would you define data checks in a yaml file, translate them to Pandera pydentic classes and validate?

Yes, exactly. So for example, for COLP I was thinking we'd just parse metadata, and implement custom checks for things like BBLs, or wkbs, etc. I think it'd just take a little glue code. But... seems like Pandera would most easily facilitate us writing our declarative checks in the format of our choice. At a glance, it seems like it's the most lightweight and hackable.

And I feel you with pandas dataframes converting. I've certainly felt that pain, but it mostly goes away when you read in everything as a string. I suppose I'd have concerns in the opposite direction, with potential type coercion happening when importing into a database.

Maybe it makes sense for me to quickly POC what I've described?

sf-dcp commented 3 days ago

If it's a quick POC to do, then yeah, it would be helpful to see!

sf-dcp commented 3 days ago

I'm not sure Pandera is the right tool for dq checks during product builds. It would work for output files, but not for intermediate tables...

sf-dcp commented 3 days ago

Update, geospatial queries work in Soda with Postgres! @damonmcc figured it out :)

image

I revised the PR with code as seen above.

Next step for me is to explore DuckDB with Soda for local files.