NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Ingest: Clean up & Simplify Tests #786

Closed sf-dcp closed 3 weeks ago

sf-dcp commented 3 weeks ago

Related to #631.

Motivation

Clean up to_parquet fn and its associated tests.

Changes

1) Previously, to_parquet fn took in a recipes.Config object as its input and then parsed it inside. Given that the Config object is evolving (the structure of Config object may change, i.e. new attributes) and we just need its file.Format object, I replaced recipes.Config with file.Format. This simplifies tests dramatically because we don't need to worry about creating a whole fake recipes.Config object to test to_parquet.

2) I removed the part downloading data from s3 from to_parquet fn. We already download data in run.py before we call to_parquet, so the data is already available locally. This change makes to_parquet more atomic.

3) Update test resources. Previously, we were creating fake recipes.Config objects in a test file, and the code started becoming lengthy because of a wide variety of data formats (test cases). So make it more simple, I create custom objects in transform_to_parquet_template.yml where each object, i.e. test case, consists of file.Format and test file name. I feel it's easier to maintain these tests long term if we have all info about test cases in one place, the yaml file.

fvankrieken commented 3 weeks ago

Love it! Have one comment about a few lines that maybe should be cut but other than that this is great, everything is much cleaner with these tweaks