Improve coverage of `subplant_id` identification in `analysis.epa_crosswalk`

grgmiller commented 1 year ago

Hi @TrentonBush I wanted to start a conversation about some potential improvements to the epa_crosswalk module. We have been using this extensively for our hourly egrid project to group data by subplant_id.

However, I've noticed that 1) not all CEMS plants/units have a subplant ID assigned and 2) this does not create any subplant groupings for plants that only exist in EIA-860/923 but do not report to CEMS.

Improving coverage for CEMS units

My understanding is that if a CAMD plant/unit does not appear in the EPA-EIA crosswalk, it will not have a subplant ID assigned. I'm not sure if there is much we can do about this except improve the coverage of the crosswalk itself. It also looks like currently some data that is getting dropped because of leading zeros that appear in the CAMD_UNIT_ID or EIA_GENERATOR_ID columns that do not appear in the epacems data (I think @aesharpe is working on fixing this as part of https://github.com/catalyst-cooperative/pudl/pull/1692)

However, in filter_out_unmatched, the code currently removes any CAMD units where the MATCH_TYPE_GEN includes "CAMD Unmatched" units, which means that the unit has not been matched to an EIA generator ID. This makes sense, since currently the subplant is partially assigned based on the edges between CAMD_UNIT_ID and EIA_GENERATOR_ID. However, the crosswalk also includes a column for CAMD_GENERATOR_ID, which I'm thinking could potentially be used instead of the EIA_GENERATOR_ID for these "CAMD Unmatched" units. I think this would only be useful for those CAMD plants that do not appear in the EIA data as well.

Creating `subplant_id`s for plants that don't report to CEMS

It would be useful to have subplant_ids for plants/generators that report to EIA but not to CAMD/CEMS. It seems maybe like the unit_id_pudl column that is created in the boiler_generator_assn_eia860 table is perhaps the equivalent of a subplant_id (ie a grouping of boilers and generators that cannot be separated), so maybe those unit_id_pudl could be integrated into the epa_crosswalk process to create subplant_id for these plants? I'm thinking that you might not want to use the unit_id_pudl column directly as a subplant id in case a plant has some generators that report to CEMS, and some that don't, in which case you might end up with a subplant_id and a unit_id_pudl that use the same integer value to represent different parts of the plant.

grgmiller commented 1 year ago

So I've noticed that currently the subplant_id grouping is only based on matching CAMD_UNIT_ID (now called emissions_unit_id_epa in pudl) with EIA_GENERATOR_ID (generator_id in pudl), but the subplant identification does not take into account any boiler-generator associations or unit_id_pudl mappings.

For example, for plant_id_eia == 1391, the epa_crosswalk code identifies generators 1A, 2A, and 3A as three separate subplants, even though the boiler-generator association identifies these three separate generators as part of one single unit_id_pudl, due to their m:m boiler-generator relationships. This subplant mapping should take the boiler-generator associations into account as well.

aesharpe commented 1 year ago

Thanks for flagging this. Now that I'm the one working on crosswalk stuff, I can address this. I'll check-in with Trenton about why he removed boilers in the first place.

aesharpe commented 1 year ago

I think this might have something to do with the discrepancies between the crosswalk and the bga table.

This is from the crosswalk:

	plant_id_epa	emissions_unit_id_epa	generator_id_epa	plant_id_eia	boiler_id	generator_id
1068	1391	1A	1A	1391	1A	1A
1069	1391	2A	2A	1391	2A	2A
1070	1391	3A	3A	1391	3A	3A
1071	1391	4A	4A	1391		4A
1072	1391	5A	5A	1391		5A

And this is from the bga table from 2018 (theoretically what the crosswalk was based on):

	plant_id_eia	report_date	generator_id	boiler_id	unit_id_pudl	steam_plant_type_code	bga_source	data_maturity
47432	1391	2018-01-01 00:00:00	1A	2A	1	1	eia860_org	final
47433	1391	2018-01-01 00:00:00	1A	1A	1	1	eia860_org	final
47434	1391	2018-01-01 00:00:00	1A	9	1	1	eia860_org	final
47435	1391	2018-01-01 00:00:00	1A	3A	1	1	eia860_org	final
47436	1391	2018-01-01 00:00:00	2A	2A	1	1	eia860_org	final
47437	1391	2018-01-01 00:00:00	2A	5A	1	1	eia860_org	final
47438	1391	2018-01-01 00:00:00	2A	4A	1	1	eia860_org	final
47439	1391	2018-01-01 00:00:00	2A	3A	1	1	eia860_org	final
47440	1391	2018-01-01 00:00:00	2A	1A	1	1	eia860_org	final
47441	1391	2018-01-01 00:00:00	3A	1A	1	1	eia860_org	final
47442	1391	2018-01-01 00:00:00	3A	3A	1	1	eia860_org	final
47443	1391	2018-01-01 00:00:00	3A	5A	1	1	eia860_org	final
47444	1391	2018-01-01 00:00:00	3A	2A	1	1	eia860_org	final
47445	1391	2018-01-01 00:00:00	3A	4A	1	1	eia860_org	final

grgmiller commented 1 year ago

Thanks. One thing that I've noticed is that the raw power sector data crosswalk does not include the complete mapping of boilers to generators that is in the EIA-860 boiler-generator association table, so I think to fix this issue, we would need to merge the bga table into the epacamd_eia crosswalk table and use those associations instead.

My understanding is that the current network analysis uses networkx.from_pandas_edgelist() to identify the edges between the emissions_unit_id_epa and generator_id for each plant. Skimming the documentation, it looks like this function maybe only works on two sets of edges, and not three or more. Ideally, I think we'd want to identify the edges between emissions_unit_id_epa, generator_id, and unit_id_pudl. However, this may have to be done in two stages if you can't identify edges between more than two ids at a time.

aesharpe commented 1 year ago

That sounds right. Maybe @TrentonBush can shed some light on this? I'll ask.

grgmiller commented 1 year ago

See https://github.com/USEPA/camd-eia-crosswalk/issues/32 for more information on getting the boiler-generator association issue fixed within the crosswalk itself.

Otherwise we probably need a post-processing step to merge BGA into the CAMD crosswalk to ensure its completeness

catalyst-cooperative / pudl