PUDL creates an id called a subplant_id in the analysis/epacamd_eia.py module. In short, this id identifies unique operating entities (combustor-generator combinations) within reported plant_id_eia groupings. This unit is important for disaggregating CEMS data. These subplant_ids represent the smallest unit of aggregation that a CEMS value can accurately map to with the given set of data.

@grgmiller has added onto this subplant id creation process in his Open Grid Emissions repo. Here Are the relevant functions:

Some subplant id steps that could be potentially integrated pudl, take a look at the update_subplant_ids and manually_update_subplant_ids functions here: https://github.com/singularity-energy/open-grid-emissions/blob/1477ac5cff002cd218bfcc768655b121a83426ac/src/data_cleaning.py#L175
The identify_subplants function (https://github.com/singularity-energy/open-grid-emissions/blob/1477ac5cff002cd218bfcc768655b121a83426ac/src/data_cleaning.py#L49) will walk you through our entire process, including how we use the existing crosswalk.

The goal of this issue is to see how difficult it would be / what it would entail to move these subplant_id cleaning steps from OGE into PUDL.

The OGE update_subplant_id() function (referenced in the first link) has the following doc string:

    Data Preparation
        Because the existing subplant_id crosswalk was only meant to map CAMD units to EIA generators, it
        is missing a large number of subplant_ids for generators that do not report to CEMS. Before applying this
        function to the subplant crosswalk, the crosswalk must be completed with all generators by outer
        merging in the complete list of generators from EIA-860 (specifically the gens_eia860 table from pudl).
        This dataframe also contains the complete list of `unit_id_pudl` mappings that will be necessary.

High-level overview of method:
        1. Use the PUDL subplant_id if available. In the case where a unit_id_pudl groups several subplants,
        we overwrite these multiple existing subplant_id with a single subplant_id.
        2. Where there is no PUDL subplant_id, we use the unit_id_pudl to assign a unique subplant_id
        3. Where there is neither a pudl subplant_id nor unit_id_pudl, we use the generator ID to
        assign a unique subplant_id
        4. All of the new unique ids are renumbered in consecutive ascending order

Detailed explanation of steps:
        1. Because the current subplant_id code does not take boiler-generator associations into account,
        there may be instances where the code assigns generators to different subplants when in fact, according
        to the boiler-generator association table, these generators are grouped into a single unit based on their
        boiler associations. The first step of this function is thus to identify if multiple subplant_id have
        been assigned to a single unit_id_pudl. If so, we replace the existing subplant_ids with a single subplant_id.
        For example, if a generator A was assigned subplant_id 0 and generator B was assigned subplant_id 1, but
        both generators A and B are part of unit_id_pudl 1, we would re-assign the subplant_id to both generators to
        0 (we always use the lowest number subplant_id in each unit_id_pudl group). This may result in some subplant_id
        being skipped, but this is okay because we will later renumber all subplant ids (i.e. if there were also a
        generator C with subplant_id 2, there would no be no subplant_id 1 at the plant)
        Likewise, sometimes multiple unit_id_pudl are connected to a single subplant_id, so we also correct the
        unit_id_pudl basedon these connections.
        2. The second issue is that there are many NA subplant_id that we should fill. To do this, we first look at
        unit_id_pudl. If a group of generators are assigned a unit_id_pudl but have NA subplant_ids, we assign a single
        new subplant_id to this group of generators. If there are still generators at a plant that have both NA subplant_id
        and NA unit_id_pudl, we for now assume that each of these generators consitutes its own subplant. We thus assign a unique
        subplant_id to each generator that is unique from any existing subplant_id already at the plant.
        In the case that there are multiple emissions_unit_id_epa at a plant that are not matched to any other identifiers (generator_id,
        unit_id_pudl, or subplant_id), as is the case when there are units that report to CEMS but which do not exist in the EIA
        data, we assign these units to a single subplant.

Scope

1. Quick and Dirty

The update_subplant_id() and manually_update_subplant_id() functions are relatively strait forward and we could just put them in the analysis/epacamd_eia.py as a bandaid for now. Seeing as they are already part of OGE it probably makes more sense to work on the underlying problem first.

2. Deep Clean

2.1 Connect BGA and Crosswalk

The core of the update_subplant_id() function is the integration of the boiler_generator_assn unit_ids with the the subplant_ids. This is depicted in #1769. Also in the the following issue, submitted by Greg, to the EPA's repo: https://github.com/USEPA/camd-eia-crosswalk/issues/32.

The EPA hasn't directly responded to this issue, so it would be work reaching out to them to ask why. In my deep EPA dive last year, I recall waiting to address this issue due to a difference in the way the crosswalk and BGA tables were designed. Before combining them, I'd want to dig back into this issue. A cursory search shows that this might not be the case, but checking with EPA is a good idea.

Snip-it of 860 Schedule 6 instructions: where BGA table is derived:

This process of determining to what extent this can and should be done should probably take one or two weeks, pending EPA availability and the difficulty of the issue at hand. We should also consider whether this type of integration is something we want to do in the EPA repo itself. Especially if we are already planning to make a PR there: #2371

2.2 See how compatible old Networkx node-mapping method is with hybrid BGA-crosswalk table.

Is this going to be a multi-node mapping project or can we just use two nodes? This is something Greg brought up in #1769. My gut instinct is that if we combine the crosswalk and the BGA table we can probably just use two nodes: emissions_unit_id_epa and unit_id_pudl seeing as generator_id is subsumed within unit_id_pudl. But this warrents further discussion with @TrentonBush and @grgmiller

This process will probably take another week or so.

2.3 Account for spot-fixes

There will almost certainly be a phase II of the project where we identify gaps in the logic/BGA/crosswalk that ought to be filled. Hopefully these findings won't compromise the integrity of the project itself. It would be a good idea to anticipate as many of these gaps before-hand as possible. This brainstorming should probably take a few days. Then the final fixing of ad-hoc spot fixes can happen after the rest of the project has been implemented on an ongoing or as-needed basis.

Total Time (including lag for communication): About a month or approximately 50 hours

Some initial thoughts & questions:

While I would love it if EPA integrated some of these fixes, it doesn't seem like they have been super responsive so I'd personally lean much more towards option 1. Especially w/ EPA's at least slight resistance to anything multi-year I just don't see the elapsed time for 2 being realistic
If we assume EPA doesn't want to integrate something like this, does option 1 feel like a bandaid or is it the solution? Another way to ask this is: are there aspects of option 2 that we'd need to
why doesn't connect_ids care about which year is being reported? Is this consistent w/ how unit_id_pudl is being generated?

One other thing to potentially scope in this process: based on my understanding of what a subplant is supposed to represent, for a combined cycle plant, ideally a subplant should contain both a CA and CT prime movers. I recently did some work (https://github.com/singularity-energy/open-grid-emissions/pull/297) to try and flag subplants where that isn't the case. If my assumption is correct, wherever a subplant is flagged, it may indicate that there is an incomplete bga mapping or EPA crosswalk mapping. Not sure if we would want to add some sort of process to link these "stranded" combined cycle parts together as part of the subplant cleaning, or instead only flag these instances and bring them to the attention of EPA and/or EIA for further investigation with the plant owners so that these mappings could be updated in the EPA crosswalk and/or BGA tables?

@grgmiller this is a good point! I know @zaneselvans did some work trying to more completely link units together a while back and a part of that was attempting to group together the combine cycle parts. I believe this is only accessible via pudl_out when unit_ids = True when initializing PudlTabl. This work was specifically attempting to have all generators have unit_id_pudl's so it currently doesn't touch the bga table, but that is something we could change.

After talking with @arengel we've decided to move forward with Option 1! More detail in #2456 issue.

catalyst-cooperative / pudl

scope sub-plant id cleaning integration #2400