NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
22 stars 0 forks source link

Ingest - validation framework #834

Open fvankrieken opened 6 months ago

fvankrieken commented 6 months ago

This issue is a bit of a stub for now

A bit orthogonal to @sf-dcp 's work around general data validation, here I specifically mean validating new ingest code (and more specifically, the parquet files it outputs) against dcpy/library so that we may switch over.

The simplest check would be iterating through datasets, processing locally without pushing to s3, and comparing parquet outputs. However, we know that gdal's (and therefore library's) parquet outputs differ from pg dumps, and so it might also make sense to use our utils to load into a database and compare.

Given that, it also probably makes sense to actually run some builds based on new data. Potentially worth aiming both library and ingest at a new s3 bucket, and running some builds off of those datasets to see how things go.

fvankrieken commented 6 days ago

To be closed when https://github.com/NYCPlanning/data-engineering/pull/1191 is closed AND validation process documented in wiki