catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Investigate subplant_id coverage/usability #2548

Closed cmgosnell closed 1 year ago

cmgosnell commented 1 year ago
- [x] are there any boilers that are not covered by the subplant_id?
  - [x] from the boilers\_entity\_eia table?
  - [x] from the boiler\_fuel\_eia923 table?
- [x] are there any subplant_ids that only partially encompass units reporting to the generation_fuel_eia923 table?
- [x] are there any subplant_ids that only partially encompass units reporting to the boiler_fuel_eia923 table?
cmgosnell commented 1 year ago

Below are my ongoing notes of the exploration. I will attach a little notebook.

Setup

# Import stuff and pull in tables we'll need
from dagster import AssetKey 
import pudl
from pudl.etl import defs

epacamd_eia_subplant = defs.load_asset_value(AssetKey("epacamd_eia_subplant_ids"))
epa_eia = defs.load_asset_value(AssetKey("epacamd_eia"))
gens_eia860 = defs.load_asset_value(AssetKey("generators_eia860"))
bga_eia860 = defs.load_asset_value(AssetKey("boiler_generator_assn_eia860"))
generators_entity_eia = defs.load_asset_value(AssetKey("generators_entity_eia"))
bf_eia923 = defs.load_asset_value(AssetKey("boiler_fuel_eia923"))
gf_eia923 = defs.load_asset_value(AssetKey("generation_fuel_eia923"))
boilers_eia = defs.load_asset_value(AssetKey("boilers_entity_eia"))

Coverage of subplant_id

Boiler Coverage in BGA

I first had to check whether there are boilers that don't show up in the boiler_generator assn

Image

So there are 11% of the boilers that don't show up in the bga table.

BGA-boiler fuel table coverage

Q: How many of those boilers missing from bga show up in the eia923 "data" tables?

image Answer: enough to care about! these guys are mostly gas unit.

Q: How much fuel is not represented in this BGA table? image A: Only 0.5%. A relatively small amount overall but not nothing. + its concentrated in a small number of plants so for those plants any downstream error would be large.

image

Q: Are any of these boiler records able to be associated with generators? Do they show up anywhere with gens in other tables?

BGA records missing from subplant id

Q: Are there boilers in the BGA that are missing from the subplant_id? presumably not bc this table was an input, but let's double check image A: nope!

Allocated GF/BF table coverage

image with proper prep.... WHICH IS A LOT... the subplant_id coverage for the allocated net gen and fuel is complete

image The generator-level version of this is much easier to manage bc we don't need to clean/ensure coverage and shape of the energy_source_code

Partial coverage of subplant_id's in generation_fuel_eia923 table

image I believe this does tell the story that there are no partially reporting subplants to the gf table!

image This also tells the same story, but I trust this outcome more because we have don't have complete boiler coverage.

cmgosnell commented 1 year ago

okay @arengel the tl;dr version of my messy musing above is:

mixed bag. I have some ideas about how to explore the boilers that aren't currently covered under a subplant_id but I'd love your thoughts before I move forward. Happy to discuss on a call if that is easier

arengel commented 1 year ago

Hey @cmgosnell, I've been testing out the new epacamd_eia_subplant_ids table and have found 3 issues:

  1. plant_id_eia=4042, in this plant there is a camd_unit that maps to two different generators, one of which is missing a unit_id_pudl, this leads to that part of the camd_unit getting a different subplant_id. From what I am seeing for this plant, it seems like the whole plant should be one subplant.
  2. plant_id_eia=2708, in this plant there are two camd_units (2A and 2B) that are each associated with two generators (one in common), the non-common ones do not have unit_id_pudls so these camd_units are associated with different subplants.
  3. plant_id_eia=55126, here there are two camd_units that are split between subplants, I think again because some of the generators do not have unit_id_pudls.

My process for finding these issues is as follows:

df = epacamd_eia_subplant_ids.groupby(
     ["plant_id_eia", "emissions_unit_id_epa"]
).agg({"subplant_id": pd.Series.nunique})
assert df[df.subplant_id > 1].empty

I also test the following but this version does not find issues:

df = epacamd_eia_subplant_ids.groupby(
     ["plant_id_eia", "generator_id"]
).agg({"subplant_id": pd.Series.nunique})
assert df[df.subplant_id > 1].empty

Then there is another issue that shows up when you run the following test:

df = epacamd_eia_subplant_ids.groupby(
     ["plant_id_epa", "emissions_unit_id_epa"]
).agg({"subplant_id": pd.Series.nunique})
assert df[df.subplant_id > 1].empty

plant_id_epa=55375 is associated with both plant_id_eia 55375 and 57664. Both of these EIA plants' have generators (and CAMD units) CT3 and CT4 which have different subplant_ids for the different EIA plants. It looks like all the units of plant_id_eia 55375 are proposed and have not been reported since 2010, I think suggesting that 57664 is the actual plant_id_eia of the facility and should be associated with plant_id_epa 55375, and that plant_id_eia 55375 should be dropped as part of the crosswalk.

cmgosnell commented 1 year ago

I pulled out these weird groupings into its own specific issue and am closing this one out for now.