Open sf-dcp opened 2 weeks ago
I will tag the team here once I implement data checks via Soda & Great Expectations to demonstrate differences between the 2 tools. Feel free to add questions/immediate thoughts in the meantime
so thorough! 👏🏾 Soda seems like a winner to me
having to read local files into DuckDB seems like a fun thing for us to be forced to lol
since DuckDB has a geospatial extension, maybe that could unlock geospatial checks someday, and maybe even lettings us check FileGDBs. looks like this open issue in the soda-core
repo is related: https://github.com/sodadata/soda-core/issues/1964
when we really need them, I think we can already use dbt
for geospatial checks via custom tests
I'm actually quite intrigued by Pandera. Since we use Pydantic already, we could potentially re-use some model code. But it also looks flexible enough to hook up to our existing metadata. We'd possibly just have to write a little glue code to to parse our product metadata and translate it into a DataFrameSchema
which is just a better version of what I'm already doing for package validation. And if we need to do geo specific checks, we could write those in dcpy.
I suppose for me the big question is whether we want to do validation in (geo)dataframes or in a database. For my use case, dataframes are preferable. Looks like Pandera integrates nicely with geopandas as well.
Thoughts?
@alexrichey on validating (geo)dataframes vs database tables:
Since we build and export from DB tables, it seems like validating tables is better than converting to and validating (geo)dataframes.
And the ability to validate files (source and packaged) seems like a significant feature we want. I guess we can always load an FBGD we've generated into a set of geodataframes, and maybe the only other alternative to validate it (via Soda) is to use DuckDB like this.
Since we build and export from DB tables, it seems like validating tables is better than converting to and validating (geo)dataframes.
Well, I need to validate post export from a database. And for Ingest code, it's validating pre-import to a database, right? Then for actual database stuff, we've got dbt.
For my use-case, the database is just another dependency to account for. It would make a lot of sense if our data was too large to store (and operate on) in memory, though. Which maybe is the case with something like PLUTO?
Would like to hear a little more about requirements on the ingest side @sf-dcp and @fvankrieken
Well, I need to validate post export from a database.
Totally, and those are files. So I imagine this is our ranking of preferred data formats to validate
@alexrichey, for your case with distributed datasets, how do you envision integration with Pandera? Would you define data checks in a yaml file, translate them to Pandera pydentic classes and validate?
I think our ideal framework is the one that's able to work with both database and files with a minimal setup and future maintenance. Also huge bonus if it's readable enough where we collaborate on data checks with GIS or other teams.
With the case of current builds and their validation in a database: it would be nice to have something that can be implemented quickly, and I feel like dbt
is not lightweight enough to do that. We would probably just want to refactor current pipelines into dbt
rather than creating checks in the end.
With the case of validation at the extract time, we can validate an existing dataframe, an existing local file (which is in .parquet
format), or load it into Postgres and do it there. So we are flexible with formats. My main reservation with Pandera is seeing it as internal DE framework only: i.e. it's not readable enough for other teams (if we are sharing python files). Unless we create a yaml wrapper
Side note... my personal preference is working with anything but pandas
dataframes because of the funkiness with data types.
When you load local data straight into geopandas
, it's fine: it acts like Postgres or gdal not changing data types. On the other hand, when it comes to regular dfs, it changes data types. For example, if you got an integer column with nulls (say bbls), pandas
converts the values to decimals and replaces nulls with NaN
values. And this behavior persists when you convert pandas
to geopandas
df (case with csv files that have geospatial data)
Would you define data checks in a yaml file, translate them to Pandera pydentic classes and validate?
Yes, exactly. So for example, for COLP I was thinking we'd just parse metadata, and implement custom checks for things like BBLs, or wkbs, etc. I think it'd just take a little glue code. But... seems like Pandera would most easily facilitate us writing our declarative checks in the format of our choice. At a glance, it seems like it's the most lightweight and hackable.
And I feel you with pandas dataframes converting. I've certainly felt that pain, but it mostly goes away when you read in everything as a string. I suppose I'd have concerns in the opposite direction, with potential type coercion happening when importing into a database.
Maybe it makes sense for me to quickly POC what I've described?
If it's a quick POC to do, then yeah, it would be helpful to see!
I'm not sure Pandera is the right tool for dq checks during product builds. It would work for output files, but not for intermediate tables...
Update, geospatial queries work in Soda with Postgres! @damonmcc figured it out :)
I revised the PR with code as seen above.
Next step for me is to explore DuckDB with Soda for local files.
Related to #650. Creating a separate issue because 1) usage of data contracts can expand beyond data library (i.e. for non-
dbt
data products and 2) to provide space for discussion.Motivation
As a part of the data library revamp (AKA ingest,AKA extract), we would like to introduce data contracts into data ingestion process to keep our data clean and consistent from the start. With data contracts, we can set clear rules on what data should look like and catch issues such as column name changes before data product builds, making actual builds faster. It will also provide transparency and enable cross-team collaboration, such as embedding business logic to enhance data quality. All in all, it should make everyone's life easier 🤞
Approach
There exist multiple frameworks for data validation. From my research, the main open source tools are Great Expectations and Soda. A smaller one is Pandera.
Thinking about our needs, I came up with the following guidelines to evaluate the tools:
Review of each framework
Pandera
mypy
for typecheckingpydentic
classes to define table schemas and data checksGreat Expectations (GX)
dbt
, can generate self-hosted docs in htmlSoda
dask
dataframe orDuckDB
(geospatial queries not available)soda core
which is their free python library and CLI tool.dask
dataframe, a distributed version ofpandas
. The dask dataframe uses presto sql dialect which can be different from postgres and doesn't have geospatial support. It could be better to read local files into DuckDB - TBD how well it works.public
)Summary
So far, Soda seems to best fit our needs in terms of simplicity, integration with Postgres & local files, custom data checks, and readability of data contracts. I'm leaning towards integration with Postgres for simplicity. It can be easily integrated with current builds: we would need to 1) define yaml file(s) with data checks and Posgres connection info and 2) run a CLI command. Don't love that geospatial checks may not be available.
Next steps
Implement same data checks via Soda & Great Expectations and compare their implementations:
Edit: the write-up was revised/enhanced after the PRs above.