catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

A few questions about changes in database schema changes, particularly `balancing_authority_code_eia` #1882

Closed arengel closed 2 years ago

arengel commented 2 years ago

I've noticed a few changes in the new version of the balancing_authority_code_eia column we get from PudlTabl.plants_eia860, I believe driven by #1749. There are values in that column that don't seem right, at least relative to the version of that column we had previously from PudlTabl.gens_eia860. The following appear to be old codes or coding mistakes given the balancing_authority_name_eia:

  1. CA rather than CISO
  2. IS rather than ISNE
  3. NVE rather than NEVP
  4. PS rather than PSCO
  5. TID rather than TIDC

Also, there are now cases where nerc_region has the value UNK instead of a null value, like pd.NA.

A few questions:

zaneselvans commented 2 years ago

I did a bunch of work cleaning up balancing authority codes in reconciling the FERC-714 and EIA-861, and the "home" table for the balancing_authority_code_eia is really in the EIA-861 data. Which isn't in the database. Which is obviously bad. Now that the codes are appropriately being associated with the plants and not the generators, the values are being harvested from any input data where the balancing authority code is reported in association with a plant ID (rather than a generator ID), and that includes a lot more tables. So I imagine what's happening here is in some years bad codes are prevalent in association with a particular plant, and so end up getting chosen as the most common code.

Since there's only a few dozen blanacing authority codes, maybe the right thing to do is to create a static balancing_authorities_eia table for the moment, and use it to recode the weird balancing authority codes in the source table appropriately before they get to the harvesting step. Though if I recall correctly from the FERC-714/EIA-861 work, there are cases a few cases in which the name/code/ID associations change from year to year as utilities merge or split, which made a mess of things.

Where are you pulling the SQLite files from?

arengel commented 2 years ago

I just grabbed the sqlite from https://data.catalyst.coop/pudl.db today.

I'm a little hazy on the ultimate source of truth on EIA balancing authority codes, or even if there could be such a thing. For convenience I often pull up the list of BA codes you get when you click on 'List of balancing authorities' on this page. It seems like it is consistent with the set in 861 and the most recent 860m.

On the point about creating a static balancing_authorities_eia table and using it to clean up weird codes early makes sense, just need to careful to make sure it is fixing actual errors or conforming legacy codes that represent the same BA, I think looking at BA names and codes will allow a decent chunk of weirdness to be fixed. Because a plant's BA can change over time, would have to be careful about propagating values from one year to fill or fix another. I'm sure there is a safe way to do it but I'm not sure how complicated it would end up being, maybe best to err on the side of leaving some weirdness.

zaneselvans commented 2 years ago

Yeah, I think we want to do these fixes only within a given year (no forward/back filling). It would be a human compiled list of fixes with the purpose of just ensuring that a given BA always uses the same code, and we associate a canonical representation of the name with that code in the little static BA entity table.

I don't know if this is literally true, but I think of the EIA-861 as the closest thing to a source of truth on the BAs since there's a table specifically dedicated to reporting some of their attributes there, including constituent utilities.

arengel commented 2 years ago

While I think this is the right move for PUDL, the Hub does need the BA codes to show up similarly as how they used to. I'm wondering if your old process normalizing BA codes might be a useful starting point for implementing this fix in the Hub, and if so, if you might be able to point me to it.

zaneselvans commented 2 years ago

Which old normalization process do you mean? The way it would have happened when the balancing_authority_code_eia was in the generators_eia860 table, rather than in the plants_eia860 table? Or the normalization that happens in the EIA-861?

The EIA-861 process is in pudl.transform.eia861.py. The useful functions to look at would be:

The results of the EIA-861 process can be obtained with these output methods:

PudlTabl.balancing_authority_eia861()
PudlTabl.balancing_authority_assn_eia861()

The "old" harvesting process as it ran when balancing_authority_code_eia was in the generators_eia860 table rather than the plants_eia860 table would require running the ETL / EIA harvesting process with a different set of parameters for constructing the permanent and annual plant / generator entity tables, such that the reported values would be collected from a different set of raw input tables, and output into the generator table as before.

If we build a static coding table for the balancing authorities and categorize all the various codes by hand (there can't be that many of them) then we'd be able to run that process on all of the input tables before they get to the harvesting step, and the harvested values would definitely correspond to the manual categorization.

Does this sound correct to your @cmgosnell ?

arengel commented 2 years ago

Thanks for those about how this works in PUDL and apologies for not being clear about the Hub's immediate challenge. The immediate issue I am trying to resolve is to get balancing_authority_code_eia as it used to be in PudlTabl.gens_eia860. Originally I didn't look at this as closely as I should have and so somewhat mischaracterized the specific issues. There are three that I see:

  1. Compared to the old method, there are more missing BA codes [in years where they could be reported], and there are some inconsistencies in the codes used for the same BA.
  2. There are also cases where the BA for a plant changes between years.
  3. The part I didn't realize before, is that before 2013, 860 does not have BA codes.

I think I somewhat adequately laid out (1) and (2) above. From my understanding of how these processes work, the static coding table would address (1). I think we both agreed that (2) is a perhaps unfortunate part of the world that PUDL should faithfully report. It's (3) that I was less sure about.

I just chatted with @cmgosnell since she is familiar with relevant parts of the Hub and sounds like she has a plan for addressing (1) and (3) so I think I'm all set for now.

zaneselvans commented 2 years ago

The reason we moved the codes ultimately was that they do change from year to year, albeit rarely. We can definitely get them back into the expected place in the denormalized generators output table, and if we codify them manually most of the inconsistency and missing values will go away, but it turns out that assuming they're static is wrong. We could backfill prior to 2013, but that would also introduce a false consistency.

cmgosnell commented 2 years ago

Back-filling will definitely introduce false consistency, but if we don't develop a methodology for it, explain it and test it everyone who wanted these codes will do their own weird backfilling. My current plan is to test out a few methodologies that we'd employ in the output layer.

zaneselvans commented 2 years ago

Yeah it seems like the right thing to do -- just another thing that should ultimately be implemented in the denormalized outputs / database views one way or another, and labeled so folks know what they're getting into.

IIRC the BAs codes and their utility (but not plant) associations are reported significantly further back than 2013 in the EIA-861 data, so maybe that can help keep things real.