NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Ingest - validation framework #834

Open fvankrieken opened 1 month ago

fvankrieken commented 1 month ago

This issue is a bit of a stub for now

A bit orthogonal to @sf-dcp 's work around general data validation, here I specifically mean validating new ingest code (and more specifically, the parquet files it outputs) against dcpy/library so that we may switch over.

The simplest check would be iterating through datasets, processing locally without pushing to s3, and comparing parquet outputs. However, we know that gdal's (and therefore library's) parquet outputs differ from pg dumps, and so it might also make sense to use our utils to load into a database and compare.

Given that, it also probably makes sense to actually run some builds based on new data. Potentially worth aiming both library and ingest at a new s3 bucket, and running some builds off of those datasets to see how things go.