Closed zaneselvans closed 1 year ago
Should we have separate denormalized versions of the gen fuel and nuke gen fuel tables? Or should they just be available combined?
This issue is quite entangled with #2433
@bendnorman It seems like pandas and SQLAlchemy don’t treat database views quite like tables. This is a problem since we rely on uniform table-like behavior in many places.
For example:
pd.read_sql("denorm_plants_utilities_ferc1", pudl_engine) # fails.
pd.read_sql_table("denorm_plants_utilities_ferc1", pudl_engine) # fails.
pd.read_sql_query("SELECT * FROM denorm_plants_utilities_ferc1", pudl_engine) # succeeds.
Also:
md = sa.MetaData()
md.reflect(pudl_engine)
"denorm_plants_utilities_ferc1" in sorted(md.tables.keys()) # False
I wonder if this would all work if we did CREATE TABLE
rather than CREATE VIEW
?
Not sure what the right approach is. Maybe we shouldn't be trying to do any of this in SQL?
denorm_generation_eia923
denorm_generation_as_eia923
denorm_generation_ms_eia923
denorm_fuel_receipts_costs_eia923
denorm_fuel_receipts_costs_as_eia923
denorm_fuel_receipts_costs_ms_eia923
denorm_boiler_fuel_eia923
denorm_boiler_fuel_as_eia923
denorm_boiler_fuel_ms_eia923
denorm_generation_fuel_nuclear_eia923
denorm_generation_fuel_nonuclear_eia923
denorm_generation_fuel_all_eia923
denorm_generation_fuel_nuclear_as_eia923
denorm_generation_fuel_nonuclear_as_eia923
denorm_generation_fuel_all_as_eia923
denorm_generation_fuel_nuclear_ms_eia923
denorm_generation_fuel_nonuclear_ms_eia923
denorm_generation_fuel_all_ms_eia923
generation_fuel_all_eia923
table first, and then provide the 3 denormalized frequency versions (with freq=None
, AS
, MS
).There are several aggregated & allocated versions of generation fuel data:
gen_fuel_by_generator_eia923
gen_fuel_by_generator_energy_source_eia923
gen_fuel_by_generator_energy_source_owner_eia923
PudlTabl
object) so this isn't too complex.We have lots of tables that need to get aggregated at either annual or monthly granularity. These tables are often related to each other by dependencies. They form a sub-graph of the larger DAG. If we want to output two sets of the same tables, with one aggregated monthly, and the other annually, the exact same graph ops should be able to do both, getting run twice (with freq="MS"
and freq="AS"
).
It seems like each of the sets of frequency specific outputs could be represented as a graph-backed multi-asset. So maybe what we need here is a graph-backed multi-asset factory, that takes freq
as a parameter?
I think a graph backed multi asset factory is a good idea. What are some examples of these sub-graphs? How complex are they?
The main sub-graph I'm working on right now is the several different generation_fuel
variants that are used in the net generation allocation process, nearly all of which can run at monthly or yearly frequency. These include all the EIA-923 tables except fuel_receipts_costs_eia923
!
generation_eia923
boiler_fuel_eia923
generation_fuel_eia923
(non-nuke)generation_fuel_nuclear_eia923
generation_fuel_all_eia923
(combined nuke + non-nuke)gen_fuel_by_generator_eia923
gen_fuel_by_generator_energy_source_eia923
gen_fuel_by_generator_energy_source_owner_eia923
Plus the annual-only generators_eia860
and boiler_generator_assn_eia860
for good measure!
Instead of creating a graph-backed multi-asset, we might be able to create a factory that returns multiple assets that represent one of these sub-graphs. Something like this:
from dagster import asset, Definitions, AssetIn
import pandas as pd
def asset_graph_factory(freq):
@asset(name=f"a_{freq}")
def a():
# aggregate based on freq
return pd.DataFrame()
@asset(name=f"b_{freq}", ins={f"a_{freq}": AssetIn()})
def b(**ins):
# aggregate based on freq
return pd.DataFrame()
@asset(name=f"c_{freq}", ins={f"b_{freq}": AssetIn()})
def c(**ins):
# aggregate based on freq
return pd.DataFrame()
return (a, b, c)
defs = Definitions(
assets=[
*asset_graph_factory("ms"),
*asset_graph_factory("as"),
*asset_graph_factory("all"),
],
)
Notes from design discussion with @bendnorman @cmgosnell @jdangerx:
generation_fuel
table to be aggregated & denormalized. The normalized inputs can chill in the DB as OG data but don't need to run through all these complications.@jdangerx I’m trying to implement the asset factory pattern that @bendnorman suggested but I think I am misunderstanding something about how to return several assets simultaneously from the factory function…
If I do:
def generation_fuel_agg_eia923_asset_factory(
freq: Literal["AS", "MS"],
io_manager_key: str | None = None,
) -> tuple:
...
generation_fuel_agg_eia923_assets = [
generation_fuel_agg_eia923_asset_factory(freq=freq) for freq in AGG_FREQS
]
then the assets aren't picked up. But I also can't unpack the returned tuple
with the *
operator like Ben did in his example (It's a syntax error):
File "/Users/zane/code/catalyst/pudl/src/pudl/output/new_eia923.py", line 130
generation_fuel_monthly_eia923_assets = *generation_fuel_agg_eia923_asset_factory(
SyntaxError: can't use starred expression here
Hmm. If I have it return a list instead of a tuple and then call each frequency separately, rather than trying to do a list comprehension it seems to work:
generation_fuel_monthly_eia923_assets = generation_fuel_agg_eia923_asset_factory(
freq="MS", io_manager_key=None
)
generation_fuel_yearly_eia923_assets = generation_fuel_agg_eia923_asset_factory(
freq="AS", io_manager_key=None
)
Not sure why I can't just do:
generation_fuel_agg_eia923_assets = [
ass for ass in
generation_fuel_agg_eia923_asset_factory(freq=freq, io_manager_key=None)
for freq in AGG_FREQS
]
Can you do:
generation_fuel_agg_eia923_assets = [
ass for freq in AGG_FREQS
for ass in generation_fuel_agg_eia923_asset_factory(freq=freq, io_manager_key=None)
]
to flatten the list of assets?
Closed in #2519
For each of the tables below:
PudlTabl
output object to refer to the database rather than doing its own calculations.This issue should replace all of the code in
pudl.output.eia923
andpudl.output.eia860
✅
boiler_fuel_eia923
✅
fuel_receipts_costs_eia923
denorm_fuel_receipts_costs_eia923
has all ofstate
,mine_state
andplant_state
, and different numbers of all of them, which seems wrong.denorm_fuel_receipts_costs_eia923
has a bunch of extra_rollfilled
columns that should have been dropped by the rolling average function... and these show up in both the existing and dagster outputs! Actually these columns never should have been added. The one data column (fuel_cost_per_mmbtu
) that's actually getting filled is fine -- and its_rollfilled
doppelganger is dropped correctly. It's all these extra columns that are hanging around.✅
generation_eia923
✅
generation_fuel_eia923
pudl.output.eia923.generation_fuel_all_eia923()
This is duplicative and we should dedupe it, but just trying to make it work for the moment.The more complex allocated and aggregated
generation_fuel_eia923
assets are part of #2435