catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Remove unreasonable subplant_id groupings #2583

Open cmgosnell opened 1 year ago

cmgosnell commented 1 year ago

Fix the bad subplant id groupings that @arengel found.

Details below copied from this comment

Hey cmgosnell, I've been testing out the new epacamd_eia_subplant_ids table and have found 3 issues:

  1. plant_id_eia=4042, in this plant there is a camd_unit that maps to two different generators, one of which is missing a unit_id_pudl, this leads to that part of the camd_unit getting a different subplant_id. From what I am seeing for this plant, it seems like the whole plant should be one subplant.
  2. plant_id_eia=2708, in this plant there are two camd_units (2A and 2B) that are each associated with two generators (one in common), the non-common ones do not have unit_id_pudls so these camd_units are associated with different subplants.
  3. plant_id_eia=55126, here there are two camd_units that are split between subplants, I think again because some of the generators do not have unit_id_pudls.

My process for finding these issues is as follows:

df = epacamd_eia_subplant_ids.groupby(
     ["plant_id_eia", "emissions_unit_id_epa"]
).agg({"subplant_id": pd.Series.nunique})
assert df[df.subplant_id > 1].empty

I also test the following but this version does not find issues:

df = epacamd_eia_subplant_ids.groupby(
     ["plant_id_eia", "generator_id"]
).agg({"subplant_id": pd.Series.nunique})
assert df[df.subplant_id > 1].empty

Then there is another issue that shows up when you run the following test:

df = epacamd_eia_subplant_ids.groupby(
     ["plant_id_epa", "emissions_unit_id_epa"]
).agg({"subplant_id": pd.Series.nunique})
assert df[df.subplant_id > 1].empty

plant_id_epa=55375 is associated with both plant_id_eia 55375 and 57664. Both of these EIA plants' have generators (and CAMD units) CT3 and CT4 which have different subplant_ids for the different EIA plants. It looks like all the units of plant_id_eia 55375 are proposed and have not been reported since 2010, I think suggesting that 57664 is the actual plant_id_eia of the facility and should be associated with plant_id_epa 55375, and that plant_id_eia 55375 should be dropped as part of the crosswalk.

aesharpe commented 1 year ago

plant_id_eia=4042, in this plant there is a camd_unit that maps to two different generators, one of which is missing a unit_id_pudl, this leads to that part of the camd_unit getting a different subplant_id. From what I am seeing for this plant, it seems like the whole plant should be one subplant. plant_id_eia=2708, in this plant there are two camd_units (2A and 2B) that are each associated with two generators (one in common), the non-common ones do not have unit_id_pudls so these camd_units are associated with different subplants. plant_id_eia=55126, here there are two camd_units that are split between subplants, I think again because some of the generators do not have unit_id_pudls.

I feel like these could be solved with #2535?