catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Add multi_asset output names to metadata #2086

Closed bendnorman closed 1 year ago

bendnorman commented 1 year ago

multi_assets require you to specify the outputs in the mutli_asset decorators. We have some multi_assets that output dozens of assets so we'll need a way to look up table names by multi_asset name.

We could add a "multi_asset" field to our resource metadata so we can construct multi_asset outputs. Something like this:

eia_assets = load_assets_from_modules([eia860, eia923])

@multi_asset(
    ins={
        asset_key.to_python_identifier(): AssetIn()
        for eia_asset in eia_assets
        for asset_key in eia_asset.asset_keys
    }
    outs={resource.name: AssetOut() for resource in Package.from_multi_asset("eia_transform").resources}
)
def eia_transform(**eia_transformed_dfs):
    ...
bendnorman commented 1 year ago

The raw EIA

I implemented thePackage.get_etl_group_tables() method which returns the table names for a given etl group. ETL groups mostly align with multi assets. The EIA raw tables have not been added to the metadata so the table names are list manually in the pudl.extract.eia module. Also, the raw and intermediate EIA assets are using the fs_io_manager. We'll need to add these tables to the metadata before using the SQLiteIOManager so we can preserve dtypes.