Closed zaneselvans closed 2 years ago
@bendnorman @zschira Curious what you think about this. To me it seems weird to build these nice self-contained Settings objects only to then disembowel them one step into the ETL.
I think it makes sense to keep the settings objects in-tact throughout the pipeline. How do we want to handle datasets using the excel GenericExtractor
though? We could still pass the settings objects in and pull the working partitions out, but I wasn't sure how valuable it was in that case where you'd have to take a generic settings object anyway.
We've created Pydantic
Settings
classes that contain the information required to run the ETL, but all of the lower level extract and transform functions still take lists of individual years, tables, etc. as their arguments, instead of passing the settings object around.For example in
pudl.etl._etl_eia()
this code:Could become:
Using the
Settings
classes natively would better encapsulate this information and mean that all these intermediary steps don't need to know the details of how to parse the individual settings. It would also mean that if/when the contents of the Settings object needs to change, it can, and the destination for that information (the lower level extract or transform functions) will still have access to all of the information they need, without our needing to change any of these intermediary functions to pass additional parameters throughout.Similar refactors to natively use the new Settings should also happen for the
ferc1
and other datasets.