zaneselvans commented 1 year ago

- [x] Create `heat_rate_by_unit` asset for both `AS` and `MS` frequency
- [x] Create `heat_rate_by_generator` asset for both `AS` and `MS` frequency
- [x] Create `fuel_cost_by_generator` asset for both `AS` and `MS` frequency
- [x] Create `capacity_factor_by_generator` asset for both `AS` and `MS` frequency
- [x] Create `mcoe` asset for both `AS` and `MS` frequency
- [x] Check whether the MCOE outputs look reasonable
- [x] Add `mcoe` to DB
- [x] Add `heat_rate_by_unit` to DB
- [x] Add `heat_rate_by_gen` to DB
- [x] Add `capacity_factor` to DB
- [x] Add `fuel_cost` to DB
- [x] Update `PudlTabl` to read all MCOE outputs from the DB
- [x] Run integration & validation tests on the MCOE outputs
- [x] Decide if/how to filter MCOE outputs for valid capacity factor, heat rate, and fuel prices
- [x] Decide which of the pre-MCOE tables should actually live in the DB (hr, cf, fc)
- [x] Decide which wide / filled MCOE tables are required with @cmgosnell

zaneselvans commented 1 year ago

Notes from Sanity Checks:

Extremely low (11.5%) coverage of unit_id_pudl and partial (~70%) coverage of fuel prices means that only a small fraction of all heat rates (10.1%) and fuel costs (6.5%) can be estimated at all. Has it always been this bad?

In the MCOE output we have far better coverage of net_generation_mwh than for total_mmbtu which I think is the equivalent value on the fuel side. Now that we have allocated estimates for both net generation and fuel consumption, should we be using those estimates in tandem in this analysis to estimate heat rates by generator-month? What's the difference between these two estimates of total fuel consumption at the generator level?

data_cols = [
    "unit_id_pudl",
    "capacity_factor",
    "fuel_cost_per_mmbtu",
    "fuel_cost_per_mwh",
    "heat_rate_mmbtu_mwh",
    "net_generation_mwh",
    "total_fuel_cost",
    "total_mmbtu",
]

dude = (
    mcoe_monthly[["report_date"] + data_cols]
    .groupby("report_date")
    .apply(lambda x: x.isna().sum() / len(x))
)
for col in data_cols:
    plt.plot(dude[col], label=col)
plt.ylabel("Missigness")
plt.legend()

Using the current allocations of fuel consumption and net generation instead, we get much better coverage. And we're already using the allocated net generation side of things, so why would we not also want to use the per-generator fuel allocations? Do we not trust them as much?

gf_by_gen_monthly = defs.load_asset_value(AssetKey("generation_fuel_by_generator_monthly_eia923"))
gf_by_gen_monthly["heat_rate_mmbtu_mwh"] = gf_by_gen_monthly.fuel_consumed_for_electricity_mmbtu / gf_by_gen_monthly.net_generation_mwh

data_cols = [
    "unit_id_pudl",
    "heat_rate_mmbtu_mwh",
    "net_generation_mwh",
    "fuel_consumed_for_electricity_mmbtu",
]

dude = (
    gf_by_gen_monthly[["report_date"] + data_cols]
    .groupby("report_date")
    .apply(lambda x: x.isna().sum() / len(x))
)
for col in data_cols:
    plt.plot(dude[col], label=col)
plt.ylabel("Missignness")
plt.legend()

Given that we're calculating all these metrics by generator and timestep and then joining them all together, it seems like maybe we should just load the final MCOE table into the database, rather than all of the intermediary assets like capacity_factor and fuel_cost and hr_by_gen. Are there places outside of the MCOE calculation that these intermediary values are being accessed / calculated directly? I guess we'll find out.

zaneselvans commented 1 year ago

For the FERC to EIA analysis:

Need all of the columns (but not really all of them go into Plant Parts / Gens Mega)
Needs all of the generators
only uses the annual data (b/c FERC is only annual)
does not use the timeseries filling (which should also be obviated by the monthly per-plant EIA fuel price estimates)

Where to go from here after chatting with Christina:

Keep the MCOE table simple as skinny / simple as possible.
Create a generators datamart table, using the last bits of the MCOE function.
Redirect PPE / MegaGens to point at the generators datamart table
Do not migrate time series filling, which is False by default and never set to true?

zaneselvans commented 1 year ago

After running the data validation and full integration tests:

I got errors due to calls to the MCOE outputs that had min/max filter values (unsurprisingly).
I added those arguments back into the mcoe output method, and set the offending values to NA as we were previously doing in the MCOE function itself.
This doesn't address the all_gens argument which doesn't do anything, so the plant_parts_eia tests should still fail (or behave strangely).
The vanilla MCOE table needs to have extraneous columns stripped from it.
The gens_cols and all_gens functionality that is needed in the context of the FERC to EIA / Plant Parts downstream need to be implemented downstream of MCOE.

catalyst-cooperative / pudl

Convert MCOE analysis to Dagster assets #2438

Notes from Sanity Checks: