catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

generators_entity_eia contains incorrect prime mover codes #1585

Open grgmiller opened 2 years ago

grgmiller commented 2 years ago

The generators_entity_eia table saved in the pudl sqlite database does not seem to always have correct prime mover codes associated with each generator.

Specific example: In generators_entity_eia, the generator with plant_id_eia == 2465 and generator_id == GT1 is associated with a "CT" prime mover code. However, in the 2020 EIA-923 and EIA-860 files, the prime mover for this generator is listed as "GT".

This is causing errors in analysis.allocate_net_gen because this depends on matching the prime mover codes.

Steps to recreate: I am loading generators_entity_eia from the pudl sqlite file downloaded from https://zenodo.org/record/6349861#.YlsSwtPMJhG I am cross referencing the raw EIA 923 and 860 files from 2020

grgmiller commented 2 years ago

Another example is plant with id 50150, the Union Carbide Seadrift Cogen plant in Texas. In the EIA forms, all of its generators are identified as either GT or ST, while in generators_entity_eia, they are all identified as CT or CA.

This suggests that perhaps the pudl code might be "correcting" the prime mover code if it thinks that it should actually be a combined cycle. If this is the case, it should be updated in all tables so that records can be matched on prime mover code across tables.

grgmiller commented 2 years ago

There is also at least one instance where the prime mover code is missing.

In generators_entity_eia, the PM is identified as "None" for all of the generators at plant 50489. However, in the 2020 EIA-860 file, all of the generators have PMs identified for them.

grgmiller commented 2 years ago

Based on our conversation @cmgosnell and @zaneselvans it seems that gens_eia860 is assigning a prime mover based on the most commonly assigned PM over multiple years. I would propose that a new output pudl table be created (something like gens_eia860_original that reports the prime mover directly as it appears in the published annual file for a single year.

Otherwise, there seems to be no way to programmatically check the assigned prime mover code using the pudl sqlite database. I am currently having to fix this by manually going to the EIA-860 excel files and cross-referencing the code there.

cmgosnell commented 2 years ago

Hey @grgmiller ! After poking at this a little, it seems worthwhile to change the way we categorize prime_mover_code in our "harvesting"/normalization process. I'm going to push a PR that'll explore transitioning prime_mover_code from a "static" (non-changing over time) to an "annual" value. It'll include an updated notebook that is our debugging tool for the harvesting process where you can see all the original inputs.

some lingering questions

We had previously assumed that the PM of a given generator does not change over time. BUT it seems like there are plants that legitimately change (see examples below).

There are 778 plant/generator_id's that have more than one PM across all the years. This is about 2% of all plant/gens. While some of them do seem to be actual conversions and change over time.... many of them seem ambiguous and are probably data entry errors.

There are 67(!) plant/generator_id/report_date's that have more than one PM within one year.

Something else to explore is how this effects the other plant/pm tables that don't have generators (and therefor haven't been getting this "static" treatment). It seems better overall to have data that is inconsistent over time if it is internally consistent between these tables.

some plant explorations

(plant_id_eia, generator_id)

grgmiller commented 1 year ago

Thanks @cmgosnell . My thought is that the best approach is to ensure that the prime_mover_code is cosnsistent between the different tables in a single year, since it is being used as a primary key to match data between the various tables.

I think what this means is that if the pm code changes over time, we should make sure that it consistently changes in all EIA tables. If we think that the change is a typo, we can change it, but we need to make sure that the change is made in all tables.

With regards to typos, is there an easy way to infer that? For example if a plant reports the pm code separately in 923 and 860, and it only changes in one of the tables, then perhaps that is a typo. However, if the pm code is consistently updated across all tables in a year, perhaps that represents an actual conversion. (however, I suppose there is also the possibility that there was a conversion, but the reporter failed to update the code in one of the tables).

zaneselvans commented 1 year ago

I think differentiating between data entry errors where someone fails to update something or uses the wrong code, and actual conversions may have to fall back on manual research. I guess there are patterns of discrepancy that would be more or less likely to represent a real conversion. Like if the code changes only once, and stays changed, and eventually shows up everywhere consistently.

Right now we only attempt to enforce a consistent value in the entity tables, after gathering together all of the observations from the many data tables. The cases where the code is a primary key and has to be left in the data tables instead of being normalized, and propagating the inferred to be correct value back out to the data tables is not something we currently have infrastructure to do (but it would be good to do! And there are statistical inference methods for doing this kind of thing, explored in by Ihab Ilyas and Xu Chu in Data Cleaning and their subsequent research.

grgmiller commented 1 year ago

Here's a specific example:

There is a generator in gens_eia860 with a 2020 report date that is completely missing prime mover data.. This is plant 59012 with generator_id "STEAM" (this is the DC Water CHP plant). It looks like in 2015-2018, this generator had data reported in EIA-860, and was listed as a "CA" prime mover, and the other three generators were listed as "CT".

However, it looks like starting in 2019, this generator was removed from EIA-860, and the other three generators changed their prime mover to "GT".

Should this generator still exist with a 2020 record in gens_eia860? If so, should it be marked as retired instead of just having all missing data associated with it?

zaneselvans commented 1 year ago

It seems like they might have just removed the HRSG, but kept the gas turbine around? Is it missing a bunch of other attributes, or just the PM? If that's the only generator-year missing a PM, then it seems like we should maybe consider it an invalid record and drop it in the transform step. I don't know what fraction of generators that go out of service end up explicitly getting a retirement date. I bet it's not 100%. I wonder if it'll show up revised in future data with a retirement date (is it in the 2021 eia860m data?)

grgmiller commented 1 year ago

It is missing a bunch of other attributes (not just PM) in both the 2019 and 2020 gens_eia860 table. It doesn't show up in the 2021 Early Release data.