Open grgmiller opened 1 year ago
So I've noticed that currently the subplant_id
grouping is only based on matching CAMD_UNIT_ID
(now called emissions_unit_id_epa
in pudl) with EIA_GENERATOR_ID
(generator_id
in pudl), but the subplant identification does not take into account any boiler-generator associations or unit_id_pudl
mappings.
For example, for plant_id_eia == 1391
, the epa_crosswalk code identifies generators 1A, 2A, and 3A as three separate subplants, even though the boiler-generator association identifies these three separate generators as part of one single unit_id_pudl
, due to their m:m boiler-generator relationships. This subplant mapping should take the boiler-generator associations into account as well.
Thanks for flagging this. Now that I'm the one working on crosswalk stuff, I can address this. I'll check-in with Trenton about why he removed boilers in the first place.
I think this might have something to do with the discrepancies between the crosswalk and the bga
table.
This is from the crosswalk:
plant_id_epa | emissions_unit_id_epa | generator_id_epa | plant_id_eia | boiler_id | generator_id | |
---|---|---|---|---|---|---|
1068 | 1391 | 1A | 1A | 1391 | 1A | 1A |
1069 | 1391 | 2A | 2A | 1391 | 2A | 2A |
1070 | 1391 | 3A | 3A | 1391 | 3A | 3A |
1071 | 1391 | 4A | 4A | 1391 | 4A | |
1072 | 1391 | 5A | 5A | 1391 | 5A |
And this is from the bga table from 2018 (theoretically what the crosswalk was based on):
plant_id_eia | report_date | generator_id | boiler_id | unit_id_eia | unit_id_pudl | boiler_generator_assn_type_code | steam_plant_type_code | bga_source | data_maturity | |
---|---|---|---|---|---|---|---|---|---|---|
47432 | 1391 | 2018-01-01 00:00:00 | 1A | 2A | 1 | 1 | eia860_org | final | ||
47433 | 1391 | 2018-01-01 00:00:00 | 1A | 1A | 1 | 1 | eia860_org | final | ||
47434 | 1391 | 2018-01-01 00:00:00 | 1A | 9 | 1 | 1 | eia860_org | final | ||
47435 | 1391 | 2018-01-01 00:00:00 | 1A | 3A | 1 | 1 | eia860_org | final | ||
47436 | 1391 | 2018-01-01 00:00:00 | 2A | 2A | 1 | 1 | eia860_org | final | ||
47437 | 1391 | 2018-01-01 00:00:00 | 2A | 5A | 1 | 1 | eia860_org | final | ||
47438 | 1391 | 2018-01-01 00:00:00 | 2A | 4A | 1 | 1 | eia860_org | final | ||
47439 | 1391 | 2018-01-01 00:00:00 | 2A | 3A | 1 | 1 | eia860_org | final | ||
47440 | 1391 | 2018-01-01 00:00:00 | 2A | 1A | 1 | 1 | eia860_org | final | ||
47441 | 1391 | 2018-01-01 00:00:00 | 3A | 1A | 1 | 1 | eia860_org | final | ||
47442 | 1391 | 2018-01-01 00:00:00 | 3A | 3A | 1 | 1 | eia860_org | final | ||
47443 | 1391 | 2018-01-01 00:00:00 | 3A | 5A | 1 | 1 | eia860_org | final | ||
47444 | 1391 | 2018-01-01 00:00:00 | 3A | 2A | 1 | 1 | eia860_org | final | ||
47445 | 1391 | 2018-01-01 00:00:00 | 3A | 4A | 1 | 1 | eia860_org | final |
Thanks. One thing that I've noticed is that the raw power sector data crosswalk does not include the complete mapping of boilers to generators that is in the EIA-860 boiler-generator association table, so I think to fix this issue, we would need to merge the bga
table into the epacamd_eia
crosswalk table and use those associations instead.
My understanding is that the current network analysis uses networkx.from_pandas_edgelist()
to identify the edges between the emissions_unit_id_epa
and generator_id
for each plant. Skimming the documentation, it looks like this function maybe only works on two sets of edges, and not three or more. Ideally, I think we'd want to identify the edges between emissions_unit_id_epa
, generator_id
, and unit_id_pudl
. However, this may have to be done in two stages if you can't identify edges between more than two ids at a time.
That sounds right. Maybe @TrentonBush can shed some light on this? I'll ask.
See https://github.com/USEPA/camd-eia-crosswalk/issues/32 for more information on getting the boiler-generator association issue fixed within the crosswalk itself.
Otherwise we probably need a post-processing step to merge BGA into the CAMD crosswalk to ensure its completeness
Hi @TrentonBush I wanted to start a conversation about some potential improvements to the
epa_crosswalk
module. We have been using this extensively for our hourly egrid project to group data bysubplant_id
.However, I've noticed that 1) not all CEMS plants/units have a subplant ID assigned and 2) this does not create any subplant groupings for plants that only exist in EIA-860/923 but do not report to CEMS.
Improving coverage for CEMS units
My understanding is that if a CAMD plant/unit does not appear in the EPA-EIA crosswalk, it will not have a subplant ID assigned. I'm not sure if there is much we can do about this except improve the coverage of the crosswalk itself. It also looks like currently some data that is getting dropped because of leading zeros that appear in the
CAMD_UNIT_ID
orEIA_GENERATOR_ID
columns that do not appear in theepacems
data (I think @aesharpe is working on fixing this as part of https://github.com/catalyst-cooperative/pudl/pull/1692)However, in
filter_out_unmatched
, the code currently removes any CAMD units where theMATCH_TYPE_GEN
includes "CAMD Unmatched" units, which means that the unit has not been matched to an EIA generator ID. This makes sense, since currently the subplant is partially assigned based on the edges betweenCAMD_UNIT_ID
andEIA_GENERATOR_ID
. However, the crosswalk also includes a column forCAMD_GENERATOR_ID
, which I'm thinking could potentially be used instead of theEIA_GENERATOR_ID
for these "CAMD Unmatched" units. I think this would only be useful for those CAMD plants that do not appear in the EIA data as well.Creating
subplant_id
s for plants that don't report to CEMSIt would be useful to have subplant_ids for plants/generators that report to EIA but not to CAMD/CEMS. It seems maybe like the
unit_id_pudl
column that is created in theboiler_generator_assn_eia860
table is perhaps the equivalent of a subplant_id (ie a grouping of boilers and generators that cannot be separated), so maybe thoseunit_id_pudl
could be integrated into theepa_crosswalk
process to create subplant_id for these plants? I'm thinking that you might not want to use theunit_id_pudl
column directly as a subplant id in case a plant has some generators that report to CEMS, and some that don't, in which case you might end up with asubplant_id
and aunit_id_pudl
that use the same integer value to represent different parts of the plant.