catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Update extract and transform functions to use Pydantic Settings #1459

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

We've created Pydantic Settings classes that contain the information required to run the ETL, but all of the lower level extract and transform functions still take lists of individual years, tables, etc. as their arguments, instead of passing the settings object around.

For example in pudl.etl._etl_eia() this code:

    eia860_tables = eia_settings.eia860.tables
    eia860_years = eia_settings.eia860.years
    eia860m = eia_settings.eia860.eia860m
    eia923_tables = eia_settings.eia923.tables
    eia923_years = eia_settings.eia923.years

    eia923_raw_dfs = pudl.extract.eia923.Extractor(ds).extract(year=eia923_years)
    eia860_raw_dfs = pudl.extract.eia860.Extractor(ds).extract(year=eia860_years)
    if eia860m:
        eia860m_raw_dfs = pudl.extract.eia860m.Extractor(ds).extract(
            year_month=eia_settings.eia860.eia860m_date
        )
        eia860_raw_dfs = pudl.extract.eia860m.append_eia860m(
            eia860_raw_dfs=eia860_raw_dfs,
            eia860m_raw_dfs=eia860m_raw_dfs
        )

    eia860_transformed_dfs = pudl.transform.eia860.transform(
        eia860_raw_dfs,
        eia860_tables=eia860_tables
    )
    eia923_transformed_dfs = pudl.transform.eia923.transform(
        eia923_raw_dfs,
        eia923_tables=eia923_tables
    )

Could become:

eia923_raw_dfs = pudl.extract.eia923.Extractor(ds).extract(eia_settings.eia923)
eia860_raw_dfs = pudl.extract.eia860.Extractor(ds).extract(eia_settings.eia860)
if eia_settings.eia860.eia860m:
    eia860m_raw_dfs = pudl.extract.eia860m.Extractor(ds).extract(eia860_settings.eia860)
    eia860_raw_dfs = pudl.extract.eia860m.append_eia860m(
            eia860_raw_dfs=eia860_raw_dfs,
            eia860m_raw_dfs=eia860m_raw_dfs,
    )
eia860_transformed_dfs = pudl.transform.eia860.transform(eia860_raw_dfs, settings=eia_settings.eia860)
eia923_transformed_dfs = pudl.transform.eia923.transform(eia923_raw_dfs, settings=eia_settings.eia923)

Using the Settings classes natively would better encapsulate this information and mean that all these intermediary steps don't need to know the details of how to parse the individual settings. It would also mean that if/when the contents of the Settings object needs to change, it can, and the destination for that information (the lower level extract or transform functions) will still have access to all of the information they need, without our needing to change any of these intermediary functions to pass additional parameters throughout.

Similar refactors to natively use the new Settings should also happen for the ferc1 and other datasets.

zaneselvans commented 2 years ago

@bendnorman @zschira Curious what you think about this. To me it seems weird to build these nice self-contained Settings objects only to then disembowel them one step into the ETL.

zschira commented 2 years ago

I think it makes sense to keep the settings objects in-tact throughout the pipeline. How do we want to handle datasets using the excel GenericExtractor though? We could still pass the settings objects in and pull the working partitions out, but I wasn't sure how valuable it was in that case where you'd have to take a generic settings object anyway.