1) Previously, to_parquet fn took in a recipes.Config object as its input and then parsed it inside. Given that the Config object is evolving (the structure of Config object may change, i.e. new attributes) and we just need its file.Format object, I replaced recipes.Config with file.Format. This simplifies tests dramatically because we don't need to worry about creating a whole fake recipes.Config object to test to_parquet.
2) I removed the part downloading data from s3 from to_parquet fn. We already download data in run.py before we call to_parquet, so the data is already available locally. This change makes to_parquet more atomic.
3) Update test resources. Previously, we were creating fake recipes.Config objects in a test file, and the code started becoming lengthy because of a wide variety of data formats (test cases). So make it more simple, I create custom objects in transform_to_parquet_template.yml where each object, i.e. test case, consists of file.Format and test file name. I feel it's easier to maintain these tests long term if we have all info about test cases in one place, the yaml file.
Related to #631.
Motivation
Clean up
to_parquet
fn and its associated tests.Changes
1) Previously,
to_parquet
fn took in arecipes.Config
object as its input and then parsed it inside. Given that the Config object is evolving (the structure of Config object may change, i.e. new attributes) and we just need itsfile.Format
object, I replacedrecipes.Config
withfile.Format
. This simplifies tests dramatically because we don't need to worry about creating a whole fakerecipes.Config
object to testto_parquet
.2) I removed the part downloading data from s3 from
to_parquet
fn. We already download data inrun.py
before we callto_parquet
, so the data is already available locally. This change makesto_parquet
more atomic.3) Update test resources. Previously, we were creating fake
recipes.Config
objects in a test file, and the code started becoming lengthy because of a wide variety of data formats (test cases). So make it more simple, I create custom objects intransform_to_parquet_template.yml
where each object, i.e. test case, consists offile.Format
and test file name. I feel it's easier to maintain these tests long term if we have all info about test cases in one place, the yaml file.