catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Improve coverage of `subplant_id` identification in `analysis.epa_crosswalk` #1769

Open grgmiller opened 1 year ago

grgmiller commented 1 year ago

Hi @TrentonBush I wanted to start a conversation about some potential improvements to the epa_crosswalk module. We have been using this extensively for our hourly egrid project to group data by subplant_id.

However, I've noticed that 1) not all CEMS plants/units have a subplant ID assigned and 2) this does not create any subplant groupings for plants that only exist in EIA-860/923 but do not report to CEMS.

Improving coverage for CEMS units

My understanding is that if a CAMD plant/unit does not appear in the EPA-EIA crosswalk, it will not have a subplant ID assigned. I'm not sure if there is much we can do about this except improve the coverage of the crosswalk itself. It also looks like currently some data that is getting dropped because of leading zeros that appear in the CAMD_UNIT_ID or EIA_GENERATOR_ID columns that do not appear in the epacems data (I think @aesharpe is working on fixing this as part of https://github.com/catalyst-cooperative/pudl/pull/1692)

However, in filter_out_unmatched, the code currently removes any CAMD units where the MATCH_TYPE_GEN includes "CAMD Unmatched" units, which means that the unit has not been matched to an EIA generator ID. This makes sense, since currently the subplant is partially assigned based on the edges between CAMD_UNIT_ID and EIA_GENERATOR_ID. However, the crosswalk also includes a column for CAMD_GENERATOR_ID, which I'm thinking could potentially be used instead of the EIA_GENERATOR_ID for these "CAMD Unmatched" units. I think this would only be useful for those CAMD plants that do not appear in the EIA data as well.

Creating subplant_ids for plants that don't report to CEMS

It would be useful to have subplant_ids for plants/generators that report to EIA but not to CAMD/CEMS. It seems maybe like the unit_id_pudl column that is created in the boiler_generator_assn_eia860 table is perhaps the equivalent of a subplant_id (ie a grouping of boilers and generators that cannot be separated), so maybe those unit_id_pudl could be integrated into the epa_crosswalk process to create subplant_id for these plants? I'm thinking that you might not want to use the unit_id_pudl column directly as a subplant id in case a plant has some generators that report to CEMS, and some that don't, in which case you might end up with a subplant_id and a unit_id_pudl that use the same integer value to represent different parts of the plant.

grgmiller commented 1 year ago

So I've noticed that currently the subplant_id grouping is only based on matching CAMD_UNIT_ID (now called emissions_unit_id_epa in pudl) with EIA_GENERATOR_ID (generator_id in pudl), but the subplant identification does not take into account any boiler-generator associations or unit_id_pudl mappings.

For example, for plant_id_eia == 1391, the epa_crosswalk code identifies generators 1A, 2A, and 3A as three separate subplants, even though the boiler-generator association identifies these three separate generators as part of one single unit_id_pudl, due to their m:m boiler-generator relationships. This subplant mapping should take the boiler-generator associations into account as well.

aesharpe commented 1 year ago

Thanks for flagging this. Now that I'm the one working on crosswalk stuff, I can address this. I'll check-in with Trenton about why he removed boilers in the first place.

aesharpe commented 1 year ago

I think this might have something to do with the discrepancies between the crosswalk and the bga table.

This is from the crosswalk:

plant_id_epa emissions_unit_id_epa generator_id_epa plant_id_eia boiler_id generator_id
1068 1391 1A 1A 1391 1A 1A
1069 1391 2A 2A 1391 2A 2A
1070 1391 3A 3A 1391 3A 3A
1071 1391 4A 4A 1391 4A
1072 1391 5A 5A 1391 5A

And this is from the bga table from 2018 (theoretically what the crosswalk was based on):

plant_id_eia report_date generator_id boiler_id unit_id_eia unit_id_pudl boiler_generator_assn_type_code steam_plant_type_code bga_source data_maturity
47432 1391 2018-01-01 00:00:00 1A 2A 1 1 eia860_org final
47433 1391 2018-01-01 00:00:00 1A 1A 1 1 eia860_org final
47434 1391 2018-01-01 00:00:00 1A 9 1 1 eia860_org final
47435 1391 2018-01-01 00:00:00 1A 3A 1 1 eia860_org final
47436 1391 2018-01-01 00:00:00 2A 2A 1 1 eia860_org final
47437 1391 2018-01-01 00:00:00 2A 5A 1 1 eia860_org final
47438 1391 2018-01-01 00:00:00 2A 4A 1 1 eia860_org final
47439 1391 2018-01-01 00:00:00 2A 3A 1 1 eia860_org final
47440 1391 2018-01-01 00:00:00 2A 1A 1 1 eia860_org final
47441 1391 2018-01-01 00:00:00 3A 1A 1 1 eia860_org final
47442 1391 2018-01-01 00:00:00 3A 3A 1 1 eia860_org final
47443 1391 2018-01-01 00:00:00 3A 5A 1 1 eia860_org final
47444 1391 2018-01-01 00:00:00 3A 2A 1 1 eia860_org final
47445 1391 2018-01-01 00:00:00 3A 4A 1 1 eia860_org final
grgmiller commented 1 year ago

Thanks. One thing that I've noticed is that the raw power sector data crosswalk does not include the complete mapping of boilers to generators that is in the EIA-860 boiler-generator association table, so I think to fix this issue, we would need to merge the bga table into the epacamd_eia crosswalk table and use those associations instead.

My understanding is that the current network analysis uses networkx.from_pandas_edgelist() to identify the edges between the emissions_unit_id_epa and generator_id for each plant. Skimming the documentation, it looks like this function maybe only works on two sets of edges, and not three or more. Ideally, I think we'd want to identify the edges between emissions_unit_id_epa, generator_id, and unit_id_pudl. However, this may have to be done in two stages if you can't identify edges between more than two ids at a time.

aesharpe commented 1 year ago

That sounds right. Maybe @TrentonBush can shed some light on this? I'll ask.

grgmiller commented 1 year ago

See https://github.com/USEPA/camd-eia-crosswalk/issues/32 for more information on getting the boiler-generator association issue fixed within the crosswalk itself.

Otherwise we probably need a post-processing step to merge BGA into the CAMD crosswalk to ensure its completeness