Closed arengel closed 2 years ago
I did a bunch of work cleaning up balancing authority codes in reconciling the FERC-714 and EIA-861, and the "home" table for the balancing_authority_code_eia
is really in the EIA-861 data. Which isn't in the database. Which is obviously bad. Now that the codes are appropriately being associated with the plants and not the generators, the values are being harvested from any input data where the balancing authority code is reported in association with a plant ID (rather than a generator ID), and that includes a lot more tables. So I imagine what's happening here is in some years bad codes are prevalent in association with a particular plant, and so end up getting chosen as the most common code.
Since there's only a few dozen blanacing authority codes, maybe the right thing to do is to create a static balancing_authorities_eia
table for the moment, and use it to recode the weird balancing authority codes in the source table appropriately before they get to the harvesting step. Though if I recall correctly from the FERC-714/EIA-861 work, there are cases a few cases in which the name/code/ID associations change from year to year as utilities merge or split, which made a mess of things.
Where are you pulling the SQLite files from?
I just grabbed the sqlite from https://data.catalyst.coop/pudl.db today.
I'm a little hazy on the ultimate source of truth on EIA balancing authority codes, or even if there could be such a thing. For convenience I often pull up the list of BA codes you get when you click on 'List of balancing authorities' on this page. It seems like it is consistent with the set in 861 and the most recent 860m.
On the point about creating a static balancing_authorities_eia
table and using it to clean up weird codes early makes sense, just need to careful to make sure it is fixing actual errors or conforming legacy codes that represent the same BA, I think looking at BA names and codes will allow a decent chunk of weirdness to be fixed. Because a plant's BA can change over time, would have to be careful about propagating values from one year to fill or fix another. I'm sure there is a safe way to do it but I'm not sure how complicated it would end up being, maybe best to err on the side of leaving some weirdness.
Yeah, I think we want to do these fixes only within a given year (no forward/back filling). It would be a human compiled list of fixes with the purpose of just ensuring that a given BA always uses the same code, and we associate a canonical representation of the name with that code in the little static BA entity table.
I don't know if this is literally true, but I think of the EIA-861 as the closest thing to a source of truth on the BAs since there's a table specifically dedicated to reporting some of their attributes there, including constituent utilities.
While I think this is the right move for PUDL, the Hub does need the BA codes to show up similarly as how they used to. I'm wondering if your old process normalizing BA codes might be a useful starting point for implementing this fix in the Hub, and if so, if you might be able to point me to it.
Which old normalization process do you mean? The way it would have happened when the balancing_authority_code_eia
was in the generators_eia860
table, rather than in the plants_eia860
table? Or the normalization that happens in the EIA-861?
The EIA-861 process is in pudl.transform.eia861.py
. The useful functions to look at would be:
balancing_authority()
balancing_authority_assn()
normalize_balancing_authority()
_ba_code_backfill()
The results of the EIA-861 process can be obtained with these output methods:
PudlTabl.balancing_authority_eia861()
PudlTabl.balancing_authority_assn_eia861()
The "old" harvesting process as it ran when balancing_authority_code_eia
was in the generators_eia860
table rather than the plants_eia860
table would require running the ETL / EIA harvesting process with a different set of parameters for constructing the permanent and annual plant / generator entity tables, such that the reported values would be collected from a different set of raw input tables, and output into the generator table as before.
If we build a static coding table for the balancing authorities and categorize all the various codes by hand (there can't be that many of them) then we'd be able to run that process on all of the input tables before they get to the harvesting step, and the harvested values would definitely correspond to the manual categorization.
Does this sound correct to your @cmgosnell ?
Thanks for those about how this works in PUDL and apologies for not being clear about the Hub's immediate challenge. The immediate issue I am trying to resolve is to get balancing_authority_code_eia
as it used to be in PudlTabl.gens_eia860
. Originally I didn't look at this as closely as I should have and so somewhat mischaracterized the specific issues. There are three that I see:
I think I somewhat adequately laid out (1) and (2) above. From my understanding of how these processes work, the static coding table would address (1). I think we both agreed that (2) is a perhaps unfortunate part of the world that PUDL should faithfully report. It's (3) that I was less sure about.
I just chatted with @cmgosnell since she is familiar with relevant parts of the Hub and sounds like she has a plan for addressing (1) and (3) so I think I'm all set for now.
The reason we moved the codes ultimately was that they do change from year to year, albeit rarely. We can definitely get them back into the expected place in the denormalized generators output table, and if we codify them manually most of the inconsistency and missing values will go away, but it turns out that assuming they're static is wrong. We could backfill prior to 2013, but that would also introduce a false consistency.
Back-filling will definitely introduce false consistency, but if we don't develop a methodology for it, explain it and test it everyone who wanted these codes will do their own weird backfilling. My current plan is to test out a few methodologies that we'd employ in the output layer.
Yeah it seems like the right thing to do -- just another thing that should ultimately be implemented in the denormalized outputs / database views one way or another, and labeled so folks know what they're getting into.
IIRC the BAs codes and their utility (but not plant) associations are reported significantly further back than 2013 in the EIA-861 data, so maybe that can help keep things real.
I've noticed a few changes in the new version of the
balancing_authority_code_eia
column we get fromPudlTabl.plants_eia860
, I believe driven by #1749. There are values in that column that don't seem right, at least relative to the version of that column we had previously fromPudlTabl.gens_eia860
. The following appear to be old codes or coding mistakes given thebalancing_authority_name_eia
:CA
rather thanCISO
IS
rather thanISNE
NVE
rather thanNEVP
PS
rather thanPSCO
TID
rather thanTIDC
Also, there are now cases where
nerc_region
has the valueUNK
instead of a null value, likepd.NA
.A few questions:
balancing_authority_code_eia
is missing but could be filled based on thebalancing_authority_name_eia
, do see filling those values as something that makes sense to include in PUDL?