Closed fvankrieken closed 1 month ago
My thoughts about location of preprocessing functions... These functions (like one dropping columns) potentially can be used in python pipelines too , and utils
appears to be the right place for utils shared across modules in dcpy
. And then you can import these functions into your Preprocessors
class. For the class itself, it seems to fit with other classes in models
module.
Otherwise, the framework around preprocessing steps looks good to me!
My thoughts about location of preprocessing functions... These functions (like one dropping columns) potentially can be used in python pipelines too , and
utils
appears to be the right place for utils shared across modules indcpy
. And then you can import these functions into yourPreprocessors
class. For the class itself, it seems to fit with other classes inmodels
module.Otherwise, the framework around preprocessing steps looks good to me!
Yeah that's going to be a funny thing moving forwards, and I think is going to be fairly dependent on what utils we actually create to meet the needs of these datasets. For some, I don't think we need a util - dropping columns on a pd dataframe is already a one-liner, so a wrapper around it in our utils would be a bit odd. But could definitely see a world where we end up putting utils there that get referenced here.
632
In essence, this pr is really about the framework - the classes in
dcpy.models.lifecycle.ingest
and the twovalidate
functions indcpy.lifecycle.ingest.transform
. The idea is that processing steps can be defined in a yml template like soFor now, functions take and return a dataframe, and any other needed arguments. We're obviously not set on this, and maybe for now just to be completely agnostic, each preprocessing step should just read from and write to a parquet file. But for a quick framework, this seemed easiest. I mainly wanted to focus on validating yml input to python function calls.
Would love input on the structure of where preprocessing functions live. Another module would probably be cleaner than this rather than this weird class, but this worked for now