NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
14 stars 0 forks source link

Scaffold for prerprocessing steps #763

Closed fvankrieken closed 1 month ago

fvankrieken commented 1 month ago

632

In essence, this pr is really about the framework - the classes in dcpy.models.lifecycle.ingest and the two validate functions in dcpy.lifecycle.ingest.transform. The idea is that processing steps can be defined in a yml template like so

name: bpl_libraries
...
processing_steps:
- name: simple_transformation
- name: drop_columns
  args:
    columns: ['col1', 'col2']

For now, functions take and return a dataframe, and any other needed arguments. We're obviously not set on this, and maybe for now just to be completely agnostic, each preprocessing step should just read from and write to a parquet file. But for a quick framework, this seemed easiest. I mainly wanted to focus on validating yml input to python function calls.

Would love input on the structure of where preprocessing functions live. Another module would probably be cleaner than this rather than this weird class, but this worked for now

sf-dcp commented 1 month ago

My thoughts about location of preprocessing functions... These functions (like one dropping columns) potentially can be used in python pipelines too , and utils appears to be the right place for utils shared across modules in dcpy. And then you can import these functions into your Preprocessors class. For the class itself, it seems to fit with other classes in models module.

Otherwise, the framework around preprocessing steps looks good to me!

fvankrieken commented 1 month ago

My thoughts about location of preprocessing functions... These functions (like one dropping columns) potentially can be used in python pipelines too , and utils appears to be the right place for utils shared across modules in dcpy. And then you can import these functions into your Preprocessors class. For the class itself, it seems to fit with other classes in models module.

Otherwise, the framework around preprocessing steps looks good to me!

Yeah that's going to be a funny thing moving forwards, and I think is going to be fairly dependent on what utils we actually create to meet the needs of these datasets. For some, I don't think we need a util - dropping columns on a pd dataframe is already a one-liner, so a wrapper around it in our utils would be a bit odd. But could definitely see a world where we end up putting utils there that get referenced here.