cmgosnell commented 1 year ago

- [x] are there any boilers that are not covered by the subplant_id?
  - [x] from the boilers\_entity\_eia table?
  - [x] from the boiler\_fuel\_eia923 table?
- [x] are there any subplant_ids that only partially encompass units reporting to the generation_fuel_eia923 table?
- [x] are there any subplant_ids that only partially encompass units reporting to the boiler_fuel_eia923 table?

cmgosnell commented 1 year ago

Below are my ongoing notes of the exploration. I will attach a little notebook.

Setup

# Import stuff and pull in tables we'll need
from dagster import AssetKey 
import pudl
from pudl.etl import defs

epacamd_eia_subplant = defs.load_asset_value(AssetKey("epacamd_eia_subplant_ids"))
epa_eia = defs.load_asset_value(AssetKey("epacamd_eia"))
gens_eia860 = defs.load_asset_value(AssetKey("generators_eia860"))
bga_eia860 = defs.load_asset_value(AssetKey("boiler_generator_assn_eia860"))
generators_entity_eia = defs.load_asset_value(AssetKey("generators_entity_eia"))
bf_eia923 = defs.load_asset_value(AssetKey("boiler_fuel_eia923"))
gf_eia923 = defs.load_asset_value(AssetKey("generation_fuel_eia923"))
boilers_eia = defs.load_asset_value(AssetKey("boilers_entity_eia"))

Coverage of subplant_id

Boiler Coverage in BGA

I first had to check whether there are boilers that don't show up in the boiler_generator assn

So there are 11% of the boilers that don't show up in the bga table.

Do any of these not-in-boilers ahve any shared attributes? Can we add the new boiler_entity_eia and boiler_eia860 table into the bga making process?
Can we attempt a simple string matching of these missing boilers to gens in the bga table that aren't currently mapped to boilers?
- the boilers from the boiler_fuel_eia923 table are inputs into the boiler_generator_assn_eia table. If the string of the boiler id is the same as the string of the generator id, they get associated, but we do not have a known process for mapping these 11% boilers missing from the bga.

BGA-boiler fuel table coverage

Q: How many of those boilers missing from bga show up in the eia923 "data" tables?

Answer: enough to care about! these guys are mostly gas unit.

Q: How much fuel is not represented in this BGA table? A: Only 0.5%. A relatively small amount overall but not nothing. + its concentrated in a small number of plants so for those plants any downstream error would be large.

Q: Are any of these boiler records able to be associated with generators? Do they show up anywhere with gens in other tables?

BGA records missing from subplant id

Q: Are there boilers in the BGA that are missing from the subplant_id? presumably not bc this table was an input, but let's double check A: nope!

Allocated GF/BF table coverage

with proper prep.... WHICH IS A LOT... the subplant_id coverage for the allocated net gen and fuel is complete

The generator-level version of this is much easier to manage bc we don't need to clean/ensure coverage and shape of the energy_source_code

Partial coverage of subplant_id's in generation_fuel_eia923 table

I believe this does tell the story that there are no partially reporting subplants to the gf table!

This also tells the same story, but I trust this outcome more because we have don't have complete boiler coverage.

cmgosnell commented 1 year ago

okay @arengel the tl;dr version of my messy musing above is:

are there any boilers that are not covered by the subplant_id?
- from the boilers_entity_eia table?
  - 😢 yes. ~11% of boilers are not connected to a subplant_id. This is because they are not connected to a generator via the original bga table or our augmented boiler-generator association represented via the unit_it_pudl
- from the boiler_fuel_eia923 table?
  - 😢 also yes. its only 0.5% of total fuel, but its a small number of plants so concentrated effect.
are there generation_fuel_eia923 records that are not covered by the subplant_id?
- 😃 it seems the allocated generation fuel data can actually all be associated with subplant_ids!
are there any subplant_ids that only partially encompass units reporting to the generation_fuel_eia923 table?
- 😄 believe it or not, it seems all the gf data is reported in whole subplants!
are there any subplant_ids that only partially encompass units reporting to the boiler_fuel_eia923 table?
- 😄 believe it or not, it seems all the bf data is reported in whole subplants!

mixed bag. I have some ideas about how to explore the boilers that aren't currently covered under a subplant_id but I'd love your thoughts before I move forward. Happy to discuss on a call if that is easier

arengel commented 1 year ago

Hey @cmgosnell, I've been testing out the new epacamd_eia_subplant_ids table and have found 3 issues:

plant_id_eia=4042, in this plant there is a camd_unit that maps to two different generators, one of which is missing a unit_id_pudl, this leads to that part of the camd_unit getting a different subplant_id. From what I am seeing for this plant, it seems like the whole plant should be one subplant.
plant_id_eia=2708, in this plant there are two camd_units (2A and 2B) that are each associated with two generators (one in common), the non-common ones do not have unit_id_pudls so these camd_units are associated with different subplants.
plant_id_eia=55126, here there are two camd_units that are split between subplants, I think again because some of the generators do not have unit_id_pudls.

My process for finding these issues is as follows:

df = epacamd_eia_subplant_ids.groupby(
     ["plant_id_eia", "emissions_unit_id_epa"]
).agg({"subplant_id": pd.Series.nunique})
assert df[df.subplant_id > 1].empty

I also test the following but this version does not find issues:

df = epacamd_eia_subplant_ids.groupby(
     ["plant_id_eia", "generator_id"]
).agg({"subplant_id": pd.Series.nunique})
assert df[df.subplant_id > 1].empty

Then there is another issue that shows up when you run the following test:

df = epacamd_eia_subplant_ids.groupby(
     ["plant_id_epa", "emissions_unit_id_epa"]
).agg({"subplant_id": pd.Series.nunique})
assert df[df.subplant_id > 1].empty

plant_id_epa=55375 is associated with both plant_id_eia 55375 and 57664. Both of these EIA plants' have generators (and CAMD units) CT3 and CT4 which have different subplant_ids for the different EIA plants. It looks like all the units of plant_id_eia 55375 are proposed and have not been reported since 2010, I think suggesting that 57664 is the actual plant_id_eia of the facility and should be associated with plant_id_epa 55375, and that plant_id_eia 55375 should be dropped as part of the crosswalk.

cmgosnell commented 1 year ago

I pulled out these weird groupings into its own specific issue and am closing this one out for now.

catalyst-cooperative / pudl

Investigate subplant_id coverage/usability #2548

Setup

Coverage of subplant_id

Boiler Coverage in BGA

BGA-boiler fuel table coverage

BGA records missing from subplant id

Allocated GF/BF table coverage

Partial coverage of subplant_id's in generation_fuel_eia923 table