catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Convert MCOE analysis to Dagster assets #2438

Closed zaneselvans closed 10 months ago

zaneselvans commented 1 year ago
- [x] Create `heat_rate_by_unit` asset for both `AS` and `MS` frequency
- [x] Create `heat_rate_by_generator` asset for both `AS` and `MS` frequency
- [x] Create `fuel_cost_by_generator` asset for both `AS` and `MS` frequency
- [x] Create `capacity_factor_by_generator` asset for both `AS` and `MS` frequency
- [x] Create `mcoe` asset for both `AS` and `MS` frequency
- [x] Check whether the MCOE outputs look reasonable
- [x] Add `mcoe` to DB
- [x] Add `heat_rate_by_unit` to DB
- [x] Add `heat_rate_by_gen` to DB
- [x] Add `capacity_factor` to DB
- [x] Add `fuel_cost` to DB
- [x] Update `PudlTabl` to read all MCOE outputs from the DB
- [x] Run integration & validation tests on the MCOE outputs
- [x] Decide if/how to filter MCOE outputs for valid capacity factor, heat rate, and fuel prices
- [x] Decide which of the pre-MCOE tables should actually live in the DB (hr, cf, fc)
- [x] Decide which wide / filled MCOE tables are required with @cmgosnell
zaneselvans commented 1 year ago

Notes from Sanity Checks:

Extremely low (11.5%) coverage of unit_id_pudl and partial (~70%) coverage of fuel prices means that only a small fraction of all heat rates (10.1%) and fuel costs (6.5%) can be estimated at all. Has it always been this bad?

In the MCOE output we have far better coverage of net_generation_mwh than for total_mmbtu which I think is the equivalent value on the fuel side. Now that we have allocated estimates for both net generation and fuel consumption, should we be using those estimates in tandem in this analysis to estimate heat rates by generator-month? What's the difference between these two estimates of total fuel consumption at the generator level?

data_cols = [
    "unit_id_pudl",
    "capacity_factor",
    "fuel_cost_per_mmbtu",
    "fuel_cost_per_mwh",
    "heat_rate_mmbtu_mwh",
    "net_generation_mwh",
    "total_fuel_cost",
    "total_mmbtu",
]

dude = (
    mcoe_monthly[["report_date"] + data_cols]
    .groupby("report_date")
    .apply(lambda x: x.isna().sum() / len(x))
)
for col in data_cols:
    plt.plot(dude[col], label=col)
plt.ylabel("Missigness")
plt.legend()

image

Using the current allocations of fuel consumption and net generation instead, we get much better coverage. And we're already using the allocated net generation side of things, so why would we not also want to use the per-generator fuel allocations? Do we not trust them as much?

gf_by_gen_monthly = defs.load_asset_value(AssetKey("generation_fuel_by_generator_monthly_eia923"))
gf_by_gen_monthly["heat_rate_mmbtu_mwh"] = gf_by_gen_monthly.fuel_consumed_for_electricity_mmbtu / gf_by_gen_monthly.net_generation_mwh

data_cols = [
    "unit_id_pudl",
    "heat_rate_mmbtu_mwh",
    "net_generation_mwh",
    "fuel_consumed_for_electricity_mmbtu",
]

dude = (
    gf_by_gen_monthly[["report_date"] + data_cols]
    .groupby("report_date")
    .apply(lambda x: x.isna().sum() / len(x))
)
for col in data_cols:
    plt.plot(dude[col], label=col)
plt.ylabel("Missignness")
plt.legend()

image

Given that we're calculating all these metrics by generator and timestep and then joining them all together, it seems like maybe we should just load the final MCOE table into the database, rather than all of the intermediary assets like capacity_factor and fuel_cost and hr_by_gen. Are there places outside of the MCOE calculation that these intermediary values are being accessed / calculated directly? I guess we'll find out.

zaneselvans commented 1 year ago

For the FERC to EIA analysis:

Where to go from here after chatting with Christina:

zaneselvans commented 1 year ago

After running the data validation and full integration tests: