Closed cmgosnell closed 1 year ago
The update_subplant_id()
and manually_update_subplant_id()
functions are relatively strait forward and we could just put them in the analysis/epacamd_eia.py
as a bandaid for now. Seeing as they are already part of OGE it probably makes more sense to work on the underlying problem first.
The core of the update_subplant_id()
function is the integration of the boiler_generator_assn
unit_id
s with the the subplant_id
s. This is depicted in #1769. Also in the the following issue, submitted by Greg, to the EPA's repo: https://github.com/USEPA/camd-eia-crosswalk/issues/32.
The EPA hasn't directly responded to this issue, so it would be work reaching out to them to ask why. In my deep EPA dive last year, I recall waiting to address this issue due to a difference in the way the crosswalk and BGA tables were designed. Before combining them, I'd want to dig back into this issue. A cursory search shows that this might not be the case, but checking with EPA is a good idea.
Snip-it of 860 Schedule 6 instructions: where BGA table is derived:
This process of determining to what extent this can and should be done should probably take one or two weeks, pending EPA availability and the difficulty of the issue at hand. We should also consider whether this type of integration is something we want to do in the EPA repo itself. Especially if we are already planning to make a PR there: #2371
Is this going to be a multi-node mapping project or can we just use two nodes? This is something Greg brought up in #1769. My gut instinct is that if we combine the crosswalk and the BGA table we can probably just use two nodes: emissions_unit_id_epa
and unit_id_pudl
seeing as generator_id
is subsumed within unit_id_pudl
. But this warrents further discussion with @TrentonBush and @grgmiller
This process will probably take another week or so.
There will almost certainly be a phase II of the project where we identify gaps in the logic/BGA/crosswalk that ought to be filled. Hopefully these findings won't compromise the integrity of the project itself. It would be a good idea to anticipate as many of these gaps before-hand as possible. This brainstorming should probably take a few days. Then the final fixing of ad-hoc spot fixes can happen after the rest of the project has been implemented on an ongoing or as-needed basis.
Total Time (including lag for communication): About a month or approximately 50 hours
Some initial thoughts & questions:
connect_ids
care about which year is being reported? Is this consistent w/ how unit_id_pudl
is being generated?One other thing to potentially scope in this process: based on my understanding of what a subplant is supposed to represent, for a combined cycle plant, ideally a subplant should contain both a CA
and CT
prime movers. I recently did some work (https://github.com/singularity-energy/open-grid-emissions/pull/297) to try and flag subplants where that isn't the case. If my assumption is correct, wherever a subplant is flagged, it may indicate that there is an incomplete bga mapping or EPA crosswalk mapping. Not sure if we would want to add some sort of process to link these "stranded" combined cycle parts together as part of the subplant cleaning, or instead only flag these instances and bring them to the attention of EPA and/or EIA for further investigation with the plant owners so that these mappings could be updated in the EPA crosswalk and/or BGA tables?
@grgmiller this is a good point! I know @zaneselvans did some work trying to more completely link units together a while back and a part of that was attempting to group together the combine cycle parts. I believe this is only accessible via pudl_out
when unit_ids = True
when initializing PudlTabl
. This work was specifically attempting to have all generators have unit_id_pudl
's so it currently doesn't touch the bga
table, but that is something we could change.
After talking with @arengel we've decided to move forward with Option 1! More detail in #2456 issue.
PUDL creates an id called a
subplant_id
in theanalysis/epacamd_eia.py
module. In short, this id identifies unique operating entities (combustor-generator combinations) within reportedplant_id_eia
groupings. This unit is important for disaggregating CEMS data. Thesesubplant_id
s represent the smallest unit of aggregation that a CEMS value can accurately map to with the given set of data.@grgmiller has added onto this subplant id creation process in his Open Grid Emissions repo. Here Are the relevant functions:
The goal of this issue is to see how difficult it would be / what it would entail to move these
subplant_id
cleaning steps from OGE into PUDL.The OGE
update_subplant_id()
function (referenced in the first link) has the following doc string: