catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Convert original EIA data tables to Dagster assets #2434

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

For each of the tables below:

This issue should replace all of the code in pudl.output.eia923 and pudl.output.eia860

boiler_fuel_eia923

fuel_receipts_costs_eia923

generation_eia923

generation_fuel_eia923

- [x] `ownership_eia860` (denorm only)
- [x] test & replace `ownership_eia860` in PudlTabl
- [x] `generation_eia923` (denorm + aggs)
- [x] `boiler_fuel_eia923` (denorm + aggs)
- [x] `fuel_receipts_costs_eia923` (denorm + aggs)
- [x] `generation_fuel_combined_eia923` (denorm + aggs)
- [x] Replace `denorm_plants_utilities_ferc1` with a Python asset, since the SQL View is causing issues.
- [x] Add logic to `PudlTabl` that allows it to get the correct table based on `self.freq`
- [x] Add new `boiler_fuel_eia923` tables to the DB
- [x] test & replace `boiler_fuel_eia923` in PudlTabl
- [x] Add new `fuel_receipts_costs_eia923` tables to the DB
- [x] test & replace `fuel_receipts_costs_eia923` in PudlTabl
- [x] Add new `generation_eia923` tables to the DB
- [x] test **basic** `generation_eia923` in PudlTabl
- [x] Add new `generation_fuel_combined_eia923` tables to the DB
- [x] test **basic** `generation_fuel_combined_eia923` in PudlTabl
- [x] Search for `TODO` in `pudl.output.eia923` and move those tweaks to the ETL
- [x] Move `assign_unit_ids` infrastructure from `pudl.output.eia860` into a `pudl.analysis` module
- [x] Attempt to remove `pudl.output.eia860`
- [x] Attempt to remove `pudl.output.eia923`

The more complex allocated and aggregated generation_fuel_eia923 assets are part of #2435

zaneselvans commented 1 year ago

Should we have separate denormalized versions of the gen fuel and nuke gen fuel tables? Or should they just be available combined?

zaneselvans commented 1 year ago

This issue is quite entangled with #2433

zaneselvans commented 1 year ago

@bendnorman It seems like pandas and SQLAlchemy don’t treat database views quite like tables. This is a problem since we rely on uniform table-like behavior in many places.

For example:

pd.read_sql("denorm_plants_utilities_ferc1", pudl_engine)  # fails.
pd.read_sql_table("denorm_plants_utilities_ferc1", pudl_engine)  # fails.
pd.read_sql_query("SELECT * FROM denorm_plants_utilities_ferc1", pudl_engine)  # succeeds.

Also:

md = sa.MetaData()
md.reflect(pudl_engine)
"denorm_plants_utilities_ferc1" in sorted(md.tables.keys())  # False

I wonder if this would all work if we did CREATE TABLE rather than CREATE VIEW?

Not sure what the right approach is. Maybe we shouldn't be trying to do any of this in SQL?

zaneselvans commented 1 year ago

Basic Denormalized EIA-923 assets

Questions:

More complex analytical outputs:

Generation Fuel:

There are several aggregated & allocated versions of generation fuel data:

Generation:

Other thoughts

PudlTabl integration considerations:

zaneselvans commented 1 year ago

We have lots of tables that need to get aggregated at either annual or monthly granularity. These tables are often related to each other by dependencies. They form a sub-graph of the larger DAG. If we want to output two sets of the same tables, with one aggregated monthly, and the other annually, the exact same graph ops should be able to do both, getting run twice (with freq="MS" and freq="AS").

It seems like each of the sets of frequency specific outputs could be represented as a graph-backed multi-asset. So maybe what we need here is a graph-backed multi-asset factory, that takes freq as a parameter?

bendnorman commented 1 year ago

I think a graph backed multi asset factory is a good idea. What are some examples of these sub-graphs? How complex are they?

zaneselvans commented 1 year ago

The main sub-graph I'm working on right now is the several different generation_fuel variants that are used in the net generation allocation process, nearly all of which can run at monthly or yearly frequency. These include all the EIA-923 tables except fuel_receipts_costs_eia923!

Plus the annual-only generators_eia860 and boiler_generator_assn_eia860 for good measure!

bendnorman commented 1 year ago

Instead of creating a graph-backed multi-asset, we might be able to create a factory that returns multiple assets that represent one of these sub-graphs. Something like this:

from dagster import asset, Definitions, AssetIn
import pandas as pd

def asset_graph_factory(freq):
    @asset(name=f"a_{freq}")
    def a():
        # aggregate based on freq
        return pd.DataFrame()

    @asset(name=f"b_{freq}", ins={f"a_{freq}": AssetIn()})
    def b(**ins):
        # aggregate based on freq
        return pd.DataFrame()

    @asset(name=f"c_{freq}", ins={f"b_{freq}": AssetIn()})
    def c(**ins):
        # aggregate based on freq
        return pd.DataFrame()

    return (a, b, c)

defs = Definitions(
    assets=[
        *asset_graph_factory("ms"),
        *asset_graph_factory("as"),
        *asset_graph_factory("all"),
    ],
)

image

zaneselvans commented 1 year ago

Notes from design discussion with @bendnorman @cmgosnell @jdangerx:

zaneselvans commented 1 year ago

@jdangerx I’m trying to implement the asset factory pattern that @bendnorman suggested but I think I am misunderstanding something about how to return several assets simultaneously from the factory function…

If I do:

def generation_fuel_agg_eia923_asset_factory(
    freq: Literal["AS", "MS"],
    io_manager_key: str | None = None,
) -> tuple:
    ...

generation_fuel_agg_eia923_assets = [
    generation_fuel_agg_eia923_asset_factory(freq=freq) for freq in AGG_FREQS
]

then the assets aren't picked up. But I also can't unpack the returned tuple with the * operator like Ben did in his example (It's a syntax error):

  File "/Users/zane/code/catalyst/pudl/src/pudl/output/new_eia923.py", line 130
    generation_fuel_monthly_eia923_assets = *generation_fuel_agg_eia923_asset_factory(

SyntaxError: can't use starred expression here
zaneselvans commented 1 year ago

Hmm. If I have it return a list instead of a tuple and then call each frequency separately, rather than trying to do a list comprehension it seems to work:

generation_fuel_monthly_eia923_assets = generation_fuel_agg_eia923_asset_factory(
    freq="MS", io_manager_key=None
)
generation_fuel_yearly_eia923_assets = generation_fuel_agg_eia923_asset_factory(
    freq="AS", io_manager_key=None
)

Not sure why I can't just do:

generation_fuel_agg_eia923_assets = [
    ass for ass in
    generation_fuel_agg_eia923_asset_factory(freq=freq, io_manager_key=None)
    for freq in AGG_FREQS
]
bendnorman commented 1 year ago

Can you do:

generation_fuel_agg_eia923_assets = [
    ass for freq in AGG_FREQS
    for ass in generation_fuel_agg_eia923_asset_factory(freq=freq, io_manager_key=None)
]

to flatten the list of assets?

e-belfer commented 1 year ago

Closed in #2519