Convert original EIA data tables to Dagster assets

zaneselvans commented 1 year ago

For each of the tables below:

Create denormalized dagster assets (including entity names + IDs, other calculations we were doing in outputs)
For the ones that need to be aggregated for yearly / monthly output, create aggregated assets from the denormalized outputs.
Compile resource metdata for the tables that are going to go into the database.
Rewire the PudlTabl output object to refer to the database rather than doing its own calculations.

This issue should replace all of the code in pudl.output.eia923 and pudl.output.eia860

✅ `boiler_fuel_eia923`

✅ `fuel_receipts_costs_eia923`

[x] denorm_fuel_receipts_costs_eia923 has all of state, mine_state and plant_state, and different numbers of all of them, which seems wrong.
[x] denorm_fuel_receipts_costs_eia923 has a bunch of extra _rollfilled columns that should have been dropped by the rolling average function... and these show up in both the existing and dagster outputs! Actually these columns never should have been added. The one data column (fuel_cost_per_mmbtu) that's actually getting filled is fine -- and its _rollfilled doppelganger is dropped correctly. It's all these extra columns that are hanging around.

✅ `generation_eia923`

[x] Test the original (not filled) generation outputs and replace in PudlTabl

✅ `generation_fuel_eia923`

[x] Nuclear gen_fuel records have infinite heat per unit because they don't report fuel units. Need to port solution from the pudl.output.eia923.generation_fuel_all_eia923() This is duplicative and we should dedupe it, but just trying to make it work for the moment.

- [x] `ownership_eia860` (denorm only)
- [x] test & replace `ownership_eia860` in PudlTabl
- [x] `generation_eia923` (denorm + aggs)
- [x] `boiler_fuel_eia923` (denorm + aggs)
- [x] `fuel_receipts_costs_eia923` (denorm + aggs)
- [x] `generation_fuel_combined_eia923` (denorm + aggs)
- [x] Replace `denorm_plants_utilities_ferc1` with a Python asset, since the SQL View is causing issues.
- [x] Add logic to `PudlTabl` that allows it to get the correct table based on `self.freq`
- [x] Add new `boiler_fuel_eia923` tables to the DB
- [x] test & replace `boiler_fuel_eia923` in PudlTabl
- [x] Add new `fuel_receipts_costs_eia923` tables to the DB
- [x] test & replace `fuel_receipts_costs_eia923` in PudlTabl
- [x] Add new `generation_eia923` tables to the DB
- [x] test **basic** `generation_eia923` in PudlTabl
- [x] Add new `generation_fuel_combined_eia923` tables to the DB
- [x] test **basic** `generation_fuel_combined_eia923` in PudlTabl
- [x] Search for `TODO` in `pudl.output.eia923` and move those tweaks to the ETL
- [x] Move `assign_unit_ids` infrastructure from `pudl.output.eia860` into a `pudl.analysis` module
- [x] Attempt to remove `pudl.output.eia860`
- [x] Attempt to remove `pudl.output.eia923`

The more complex allocated and aggregated generation_fuel_eia923 assets are part of #2435

zaneselvans commented 1 year ago

Should we have separate denormalized versions of the gen fuel and nuke gen fuel tables? Or should they just be available combined?

zaneselvans commented 1 year ago

This issue is quite entangled with #2433

zaneselvans commented 1 year ago

@bendnorman It seems like pandas and SQLAlchemy don’t treat database views quite like tables. This is a problem since we rely on uniform table-like behavior in many places.

For example:

pd.read_sql("denorm_plants_utilities_ferc1", pudl_engine)  # fails.
pd.read_sql_table("denorm_plants_utilities_ferc1", pudl_engine)  # fails.
pd.read_sql_query("SELECT * FROM denorm_plants_utilities_ferc1", pudl_engine)  # succeeds.

Also:

md = sa.MetaData()
md.reflect(pudl_engine)
"denorm_plants_utilities_ferc1" in sorted(md.tables.keys())  # False

I wonder if this would all work if we did CREATE TABLE rather than CREATE VIEW?

Not sure what the right approach is. Maybe we shouldn't be trying to do any of this in SQL?

zaneselvans commented 1 year ago

Basic Denormalized EIA-923 assets

denorm_generation_eia923 (temporal aggregation)
- denorm_generation_eia923
- denorm_generation_as_eia923
- denorm_generation_ms_eia923
denorm_fuel_receipts_costs_eia923 (temporal aggregation + merge in coalmines)
- denorm_fuel_receipts_costs_eia923
- denorm_fuel_receipts_costs_as_eia923
- denorm_fuel_receipts_costs_ms_eia923
denorm_boiler_fuel_eia923 (temporal aggregation)
- denorm_boiler_fuel_eia923
- denorm_boiler_fuel_as_eia923
- denorm_boiler_fuel_ms_eia923
denorm_generation_fuel_eia923 (temporal aggregation + nuke vs. non-nuke vs. both)
- denorm_generation_fuel_nuclear_eia923
- denorm_generation_fuel_nonuclear_eia923
- denorm_generation_fuel_all_eia923
- denorm_generation_fuel_nuclear_as_eia923
- denorm_generation_fuel_nonuclear_as_eia923
- denorm_generation_fuel_all_as_eia923
- denorm_generation_fuel_nuclear_ms_eia923
- denorm_generation_fuel_nonuclear_ms_eia923
- denorm_generation_fuel_all_ms_eia923
- Nine tables is ridiculous. We could create a normalized but un-aggregated generation_fuel_all_eia923 table first, and then provide the 3 denormalized frequency versions (with freq=None, AS, MS).
- Could also hold off on denormalization in gen

Questions:

Should we separate the yearly/monthly aggregation from the denormalization?
Should aggregated but not denormalized tables be stored in the DB or as interim assets written to disk?
Should the denormalized output tables just be the tables we want folks to use? And should that just be a small proportion of the overall collection of tables? Should we be focusing on getting all the

More complex analytical outputs:

Generation Fuel:

There are several aggregated & allocated versions of generation fuel data:

gen_fuel_by_generator_eia923
gen_fuel_by_generator_energy_source_eia923
gen_fuel_by_generator_energy_source_owner_eia923

Generation:

Needs to be able to use allocated net generation from generation fuel table. But this is a switch that controls which table is read, not how the table is generated (within the PudlTabl object) so this isn't too complex.

Other thoughts

Aggregations can be separated from denormalization.
Aggregations should only be performed on the normalized tables.
Denormalization process is identical on raw, monthly, or annual aggregates.
Aggregation asset factory specific to each output table.
Hand off aggregated table to a separate denormalization step when appropriate
Denormalization seems more uniform across tables than aggregation

PudlTabl integration considerations:

PudlTabl.freq will control what table is pulled from the DB.

zaneselvans commented 1 year ago

We have lots of tables that need to get aggregated at either annual or monthly granularity. These tables are often related to each other by dependencies. They form a sub-graph of the larger DAG. If we want to output two sets of the same tables, with one aggregated monthly, and the other annually, the exact same graph ops should be able to do both, getting run twice (with freq="MS" and freq="AS").

It seems like each of the sets of frequency specific outputs could be represented as a graph-backed multi-asset. So maybe what we need here is a graph-backed multi-asset factory, that takes freq as a parameter?

bendnorman commented 1 year ago

I think a graph backed multi asset factory is a good idea. What are some examples of these sub-graphs? How complex are they?

zaneselvans commented 1 year ago

The main sub-graph I'm working on right now is the several different generation_fuel variants that are used in the net generation allocation process, nearly all of which can run at monthly or yearly frequency. These include all the EIA-923 tables except fuel_receipts_costs_eia923!

generation_eia923
boiler_fuel_eia923
generation_fuel_eia923 (non-nuke)
generation_fuel_nuclear_eia923
generation_fuel_all_eia923 (combined nuke + non-nuke)
gen_fuel_by_generator_eia923
gen_fuel_by_generator_energy_source_eia923
gen_fuel_by_generator_energy_source_owner_eia923

Plus the annual-only generators_eia860 and boiler_generator_assn_eia860 for good measure!

bendnorman commented 1 year ago

Instead of creating a graph-backed multi-asset, we might be able to create a factory that returns multiple assets that represent one of these sub-graphs. Something like this:

from dagster import asset, Definitions, AssetIn
import pandas as pd

def asset_graph_factory(freq):
    @asset(name=f"a_{freq}")
    def a():
        # aggregate based on freq
        return pd.DataFrame()

    @asset(name=f"b_{freq}", ins={f"a_{freq}": AssetIn()})
    def b(**ins):
        # aggregate based on freq
        return pd.DataFrame()

    @asset(name=f"c_{freq}", ins={f"b_{freq}": AssetIn()})
    def c(**ins):
        # aggregate based on freq
        return pd.DataFrame()

    return (a, b, c)

defs = Definitions(
    assets=[
        *asset_graph_factory("ms"),
        *asset_graph_factory("as"),
        *asset_graph_factory("all"),
    ],
)

zaneselvans commented 1 year ago

Notes from design discussion with @bendnorman @cmgosnell @jdangerx:

We only need the combined nuke + non-nuke generation_fuel table to be aggregated & denormalized. The normalized inputs can chill in the DB as OG data but don't need to run through all these complications.
@bendnorman's suggestion of a factory that creates several different assets that have dependencies on them seems simpler than using a graph-backed multi-asset since it allows us to continue using only the software-backed asset abstraction without getting into ops and graphs.
In general we'd prefer to end up with a smaller number of wide and legible output tables, rather than a proliferation of intermediate outputs, so for the moment we might want to hold of on denormalizing any intermediate tables that we don't intend to be public facing / re-used.

zaneselvans commented 1 year ago

@jdangerx I’m trying to implement the asset factory pattern that @bendnorman suggested but I think I am misunderstanding something about how to return several assets simultaneously from the factory function…

If I do:

def generation_fuel_agg_eia923_asset_factory(
    freq: Literal["AS", "MS"],
    io_manager_key: str | None = None,
) -> tuple:
    ...

generation_fuel_agg_eia923_assets = [
    generation_fuel_agg_eia923_asset_factory(freq=freq) for freq in AGG_FREQS
]

then the assets aren't picked up. But I also can't unpack the returned tuple with the * operator like Ben did in his example (It's a syntax error):

  File "/Users/zane/code/catalyst/pudl/src/pudl/output/new_eia923.py", line 130
    generation_fuel_monthly_eia923_assets = *generation_fuel_agg_eia923_asset_factory(

SyntaxError: can't use starred expression here

zaneselvans commented 1 year ago

Hmm. If I have it return a list instead of a tuple and then call each frequency separately, rather than trying to do a list comprehension it seems to work:

generation_fuel_monthly_eia923_assets = generation_fuel_agg_eia923_asset_factory(
    freq="MS", io_manager_key=None
)
generation_fuel_yearly_eia923_assets = generation_fuel_agg_eia923_asset_factory(
    freq="AS", io_manager_key=None
)

Not sure why I can't just do:

generation_fuel_agg_eia923_assets = [
    ass for ass in
    generation_fuel_agg_eia923_asset_factory(freq=freq, io_manager_key=None)
    for freq in AGG_FREQS
]

bendnorman commented 1 year ago

Can you do:

generation_fuel_agg_eia923_assets = [
    ass for freq in AGG_FREQS
    for ass in generation_fuel_agg_eia923_asset_factory(freq=freq, io_manager_key=None)
]

to flatten the list of assets?

e-belfer commented 1 year ago

Closed in #2519

catalyst-cooperative / pudl