Closed sf-dcp closed 5 months ago
I will tag the team here once I implement data checks via Soda & Great Expectations to demonstrate differences between the 2 tools. Feel free to add questions/immediate thoughts in the meantime
so thorough! 👏🏾 Soda seems like a winner to me
having to read local files into DuckDB seems like a fun thing for us to be forced to lol
since DuckDB has a geospatial extension, maybe that could unlock geospatial checks someday, and maybe even lettings us check FileGDBs. looks like this open issue in the soda-core
repo is related: https://github.com/sodadata/soda-core/issues/1964
when we really need them, I think we can already use dbt
for geospatial checks via custom tests
I'm actually quite intrigued by Pandera. Since we use Pydantic already, we could potentially re-use some model code. But it also looks flexible enough to hook up to our existing metadata. We'd possibly just have to write a little glue code to to parse our product metadata and translate it into a DataFrameSchema
which is just a better version of what I'm already doing for package validation. And if we need to do geo specific checks, we could write those in dcpy.
I suppose for me the big question is whether we want to do validation in (geo)dataframes or in a database. For my use case, dataframes are preferable. Looks like Pandera integrates nicely with geopandas as well.
Thoughts?
@alexrichey on validating (geo)dataframes vs database tables:
Since we build and export from DB tables, it seems like validating tables is better than converting to and validating (geo)dataframes.
And the ability to validate files (source and packaged) seems like a significant feature we want. I guess we can always load an FBGD we've generated into a set of geodataframes, and maybe the only other alternative to validate it (via Soda) is to use DuckDB like this.
Since we build and export from DB tables, it seems like validating tables is better than converting to and validating (geo)dataframes.
Well, I need to validate post export from a database. And for Ingest code, it's validating pre-import to a database, right? Then for actual database stuff, we've got dbt.
For my use-case, the database is just another dependency to account for. It would make a lot of sense if our data was too large to store (and operate on) in memory, though. Which maybe is the case with something like PLUTO?
Would like to hear a little more about requirements on the ingest side @sf-dcp and @fvankrieken
Well, I need to validate post export from a database.
Totally, and those are files. So I imagine this is our ranking of preferred data formats to validate
@alexrichey, for your case with distributed datasets, how do you envision integration with Pandera? Would you define data checks in a yaml file, translate them to Pandera pydentic classes and validate?
I think our ideal framework is the one that's able to work with both database and files with a minimal setup and future maintenance. Also huge bonus if it's readable enough where we collaborate on data checks with GIS or other teams.
With the case of current builds and their validation in a database: it would be nice to have something that can be implemented quickly, and I feel like dbt
is not lightweight enough to do that. We would probably just want to refactor current pipelines into dbt
rather than creating checks in the end.
With the case of validation at the extract time, we can validate an existing dataframe, an existing local file (which is in .parquet
format), or load it into Postgres and do it there. So we are flexible with formats. My main reservation with Pandera is seeing it as internal DE framework only: i.e. it's not readable enough for other teams (if we are sharing python files). Unless we create a yaml wrapper
Side note... my personal preference is working with anything but pandas
dataframes because of the funkiness with data types.
When you load local data straight into geopandas
, it's fine: it acts like Postgres or gdal not changing data types. On the other hand, when it comes to regular dfs, it changes data types. For example, if you got an integer column with nulls (say bbls), pandas
converts the values to decimals and replaces nulls with NaN
values. And this behavior persists when you convert pandas
to geopandas
df (case with csv files that have geospatial data)
Would you define data checks in a yaml file, translate them to Pandera pydentic classes and validate?
Yes, exactly. So for example, for COLP I was thinking we'd just parse metadata, and implement custom checks for things like BBLs, or wkbs, etc. I think it'd just take a little glue code. But... seems like Pandera would most easily facilitate us writing our declarative checks in the format of our choice. At a glance, it seems like it's the most lightweight and hackable.
And I feel you with pandas dataframes converting. I've certainly felt that pain, but it mostly goes away when you read in everything as a string. I suppose I'd have concerns in the opposite direction, with potential type coercion happening when importing into a database.
Maybe it makes sense for me to quickly POC what I've described?
If it's a quick POC to do, then yeah, it would be helpful to see!
I'm not sure Pandera is the right tool for dq checks during product builds. It would work for output files, but not for intermediate tables...
Update, geospatial queries work in Soda with Postgres! @damonmcc figured it out :)
I revised the PR with code as seen above.
Next step for me is to explore DuckDB with Soda for local files.
During our discussions, a few more critical factors were brought up that should be a part of the decision matrix in addition to the original criteria. Below please find the aggregated criteria with its weight. I filled out the table to conduct quantitative comparison of Pandera & Soda.
Note, a tool is measured on a scale of 1-5 and then weighted according to criteria. For example, "Easy-to-read checks" weight for Pandera is 5 (rating) x 2 (criteria weight) = 10 (total).
Open to discussion on the ratings. Hopefully, this will provide us with a framework to make the final, well-rounded decision.
Criteria | Criteria Weight | Soda | Pandera | Notes |
---|---|---|---|---|
Documentation Quality | 1 | 4 | 5 | |
Easy-to-read data checks (cross team collaboration) | 2 | 8 | 10 | For Pandera, assumption for the rating is defining our own yaml format to make it readable. |
Enables one file for all metadata (sources, products) | 2 | 10 | 10 | For Soda, assumption for the rating is defining data checks within our custom yaml file and executing Soda via Python. To use CLI for Soda, a soda checks file needs to be a stand alone file. |
Integrates with our data file formats (ingest, packaging) | 2 | 10 | 10 | |
Integrates with Postgres (product builds) | 2 | 10 | 2 | For Pandera, there is no option to connect to database directly. |
Simple to add new data checks | 2 | 10 | 10 | |
Simple to write custom data checks | 1 | 5 | 3 | |
Total (raw score) | 70 | 57 | 50 | |
Total (%) | 100% | 81% | 71% |
@sf-dcp Would you be able to list some of your use-cases, in particular the common data checks you'd be performing, and maybe some of the more complicated cases? (especially those where Pandera might be too lightweight)
That might help us clarify useful categories and their respective weights.
Edit: Chatted with @sf-dcp and we scoped this down to just the lifecycle steps before and after the build, ie. Ingest and Package/Distribute. Although there is the lingering question of whether we'd want to potentially use SODA to QA builds, which would change the weights pretty significantly.
Criteria | Criteria Weight | Soda | Pandera | Notes |
---|---|---|---|---|
Documentation Quality | 0 | 0 | 0 | They're both fine |
Easy-to-write data checks | 2 | 3 | 4 | Breaking this into two. Pandera's ranking here clearly makes some assumptions of what we've set up (edit - knocked it down slightly for this reason). Soda takes knocks because of it effectively being its own language - in the examples here, keywords around invalid/valid, or all of the grammar around schema is not intuitive. This is where I'd much rather have check names and kwargs and leave it at that |
Easy-to-read data checks (cross team collaboration) | 2 | 5 | 4 | There are a couple things that are more awkward in Pandera (row_count > n or stuff like that is very intuitive to read if less so to write) |
Enables one file for all metadata (sources, products) | 2 | 4 | 5 | docking Soda a bit for not quite working out of the box for this, and I like the flexibility we have with Pandera |
Integrates with our data file formats (ingest, packaging) | 2 | 3 | 5 | It seems like pandera slots very cleanly into packaging based on the tests Alex was writing. Doesn't mean that Soda couldn't, but working around pandas with soda just seems a tad more cumbersome. Could lower the weight of this row though maybe |
Integrates with Postgres (product builds) | 1 | 5 | 1 | For me the weight is lower because I feel that if we're moving towards dbt in our builds, that it fits that portion nicely. Thinking now, I feel like this is something we should talk about a bit more, because dbt has remained largely out of our data validation talks. Counterpoint is that there's the barrier of actually setting up dbt for each of our products, which on its own is more overhead than Soda. So weight here is an open question. |
Simple to write custom data checks | 3 | 2 | 5 | This depends what we're talking about. One-off custom column check? Writing a query in Soda is very easy. Validating a bbl is in proper format, and re-using that custom logic across tests? I would much rather do that in python. It doesn't seem like these valid formats are extendable sadly. Unless I'm missing something, I don't see custom reusable checks in Soda? |
Simple to add new data checks | 0 | 0 | 0 | both are fine |
Cost of maintaining code around structure of checks, custom checks | 1 | 5 | 2 | This is something that I want to believe won't be a big issue if we used pandera, but have to admit that there is some cost associated here. Argument could be made to weight this higher |
Total (raw score) | 65 | 46 | 51 | |
Total (%) | 70.7% | 78.5% |
The couple I weighted zero, it's more that they're not going to differentiate them for me, not that they're not important!
Going through this made a couple things clear to me
I think this is a really great rubric. For me there are a few open questions that would affect the score on SODA. Let's discuss!
Criteria | Criteria Weight | Soda | Pandera | Notes |
---|---|---|---|---|
Documentation Quality | 1 | 3 | 7 | The draw for me here is that Pandera has about two paragraphs of relevant documentation. It's intuitive enough that I don't need documentation. |
Easy-to-read data checks (cross team collaboration) | 2 | 8 | 10 | For Pandera, assumption for the rating is defining our own yaml format to make it readable. (unchanged from Sasha's) |
Enables one file for all metadata (sources, products) | 2 | Q1? | 10 | This is the open question for me. It wasn't clear how we'd do this with SODA - though I suppose we could always generate SODA yaml files off our product metadata |
Integrates with our data file formats (ingest, packaging) | 1 | 10 | 10 | |
Integrates with Postgres (product builds) | Q2? | 10 | 0 | For Pandera, there is no option to connect to database directly. With a little bit of glue code, we could connect Pandera to any data source - but that's not a great option in certain cases (e.g. big data sets) |
Simple to add new data checks | 2 | 10 | 10 | |
Simple to write custom data checks | 1 | 5 | 10 | |
Total (raw score) | 70 | 36 + 2x(Q1?) + Q2?(10) == range of 36-76 | 57 |
these really are different tools. Soda is a lightweight, out-of-the-box tool to declare and run data checks. Pandera is a framework to set up your own system of running data checks
@fvankrieken My thinking is that Pandera is more akin to a library - you can adapt it however you'd like, but there's a definite cost to the open-endedness. Whereas SODA is very much a framework (somewhat akin to DBT, or, in another sphere entirely, Ruby On rails), with expected ways to do things right out of the box, and that comes with its own pros and cons.
Criteria | Criteria Weight | Soda | Pandera | Notes |
---|---|---|---|---|
Documentation Quality | 1 | 4 | 5 | |
Easy-to-read data checks (cross team collaboration) | 2 | 9 | 4 | |
Enables one file for all metadata (sources, products) | 2 | 8 | 8 | |
Integrates with our data file formats (ingest, packaging) | 2 | 8 | 6 | |
Integrates with Postgres (product builds) | 2 | 10 | 2 | |
Simple to add new data checks | 2 | 8 | 6 | |
Simple to write custom data checks | 1 | 6 | 8 | |
Total (raw score) | 70 | 53 | 39 | |
Total (%) | 100% | XX% | XX% |
Per in-person discussion, Pandera wins for ingest/packaging cases. Aside the Integrates with Postgres (product builds)
criteria, Pandera is the winner primarily for checks readability and integration with existing yaml files.
Regarding the ETL data checks, need more info to decide between Soda vs dbt. If we don't seek validation from external teams during the ETL process, we are leaning towards dbt
.
todo (not a part of this issue):
dcpy
(first will put in one place)
Related to #650. Creating a separate issue because 1) usage of data contracts can expand beyond data library (i.e. for non-
dbt
data products and 2) to provide space for discussion.Motivation
As a part of the data library revamp (AKA ingest,AKA extract), we would like to introduce data contracts into data ingestion process to keep our data clean and consistent from the start. With data contracts, we can set clear rules on what data should look like and catch issues such as column name changes before data product builds, making actual builds faster. It will also provide transparency and enable cross-team collaboration, such as embedding business logic to enhance data quality. All in all, it should make everyone's life easier 🤞
Approach
There exist multiple frameworks for data validation. From my research, the main open source tools are Great Expectations and Soda. A smaller one is Pandera.
Thinking about our needs, I came up with the following guidelines to evaluate the tools:
Review of each framework
Pandera
mypy
for typecheckingpydentic
classes to define table schemas and data checksGreat Expectations (GX)
dbt
, can generate self-hosted docs in htmlSoda
dask
dataframe orDuckDB
(geospatial queries not available)soda core
which is their free python library and CLI tool.dask
dataframe, a distributed version ofpandas
. The dask dataframe uses presto sql dialect which can be different from postgres and doesn't have geospatial support. It could be better to read local files into DuckDB - TBD how well it works.public
)Summary
So far, Soda seems to best fit our needs in terms of simplicity, integration with Postgres & local files, custom data checks, and readability of data contracts. I'm leaning towards integration with Postgres for simplicity. It can be easily integrated with current builds: we would need to 1) define yaml file(s) with data checks and Posgres connection info and 2) run a CLI command. Don't love that geospatial checks may not be available.
Next steps
Implement same data checks via Soda & Great Expectations and compare their implementations:
Edit: the write-up was revised/enhanced after the PRs above.