catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Prime Mover from PUDL does not match raw data EIA860 #324

Closed briannacote closed 4 years ago

briannacote commented 5 years ago

Describe the bug From a high-level, we are using the PUDL database to start to automate some processes over here. I ran into a small issue. I figured I would reach out.

I have a spreadsheet attached showing you a few units from Polk where I see the Prime Mover does not match what I see from the EIA 860 data itself. I show you the SQL I used to pull the data as well from PUDL.

PUDL_Export[1].xlsx

I also see this for at least one unit of the following EIA Plant Codes: 2059 54262 57842

To Reproduce I show the output from the PUDL database and compare it to the raw data from the 2017 EIA860 file.

Expected behavior These Prime Movers should match.

Desktop (please complete the following information):

Thanks for the help!

cmgosnell commented 5 years ago

Hey @briannacote! Do you know what years you have in your database that is generating this? I'm going to assume you have all of the years (2011-2017) for now but let me know if that is different.

Because things like prime_mover should be constant over time and should be constant over all of the tables that they are reported in, instead of keeping those values scattered throughout the database, we "harvest" them and store them in one place to have a more normalized database.

Our process pulls the most consistently reported record. We've constrained it so that each harvested field needs to be at least 70% consistent and that generates a good enough match for ~99% of all records we're harvesting. There are a few special cases where we use a different method to capture more correct records, but across the board it seems to work okay.

It looks like the plants you've referenced have a different prime mover reported in 2017. It looks like reporting inconsistencies, but if you have any visibility into whether the less consistent prime movers are indeed correct we could build in some special cases.

plant_id_eia generator_id report_date prime_mover_code occurrences consistency prime_mover_code_consistent
7242 2 2011-01-01 GT 7 5 True
7242 2 2012-01-01 GT 7 5 True
7242 2 2013-01-01 GT 7 5 True
7242 2 2014-01-01 GT 7 5 True
7242 2 2016-01-01 GT 7 5 True
7242 2 2015-01-01 CT 7 2 False
7242 2 2017-01-01 CT 7 2 False

I checked a few of the other plant ID's that you included and this one is actually a good example of this process working very well! This generator shows up 115 times across all of the years and all of the tables and just this one 2017 occurrence is says CT.

plant_id_eia generator_id report_date prime_mover_code occurrences consistency prime_mover_code_consistent
2059 7 2017-01-01 ST 115 114 True
2059 7 2017-01-01 ST 115 114 True
2059 7 2017-01-01 ST 115 114 True
2059 7 2017-01-01 ST 115 114 True
2059 7 2017-01-01 CT 115 1 False
briannacote commented 5 years ago

Oh that is awesome! It makes perfect sense. I have been so focused on 2017 alone I wasn't looking at the forest through the tress. I really appreciate the detailed explanation here to understand what's going on. This was really helpful. It definitely seems these are rogue entries and the wider picture should be used.

Just to note, I do have all years for the EIA 860 data and EIA 923 data in my database.

Thank you!

UdayVaradarajan commented 5 years ago

Hi @cmgosnell - just one addendum. Is it possible that some units may have been repowered - but not reported as a new unit? In that case, one might expect that the prime mover code could change after some date and not revert back - reflecting actual information rather that an error. Do you check for this?

briannacote commented 5 years ago

Hi @UdayVaradarajan, that is also a fair point. I didn't think about it in that way at first. But doing just some quick google-ing came up with articles like this:

https://www.power-technology.com/projects/polk-power-station-expansion-florida/

cmgosnell commented 5 years ago

Yes, this is a good question. In most of the repowering instances that I've seen, the repowered units get new generator IDs. We've done some checking for this while setting up the harvesting process, but most of that was checking whether or not the primary fuel changed with any of these inconsistencies.

But this definitely looks like the case here.

There are two types of harvested variables: static and annual. It's possible that the prime_mover_code is one of those variables that is nearly always static. While it might be truly correct to have some other mid-way between static and annual variable, it is actually very easy to move the value from static to annual if that is a more accurate arrangement. Do either of you know of any other repowered plants that didn't change primary fuel types so I can polk around ;-) around to see other patterns of reporting?

zaneselvans commented 5 years ago

This seems like a hard choice (short of implementing the option for special cases) -- is it better to have a higher frequency of errors that are someone else's (reporting utilities) or a lower frequency of errors that are ours?

UdayVaradarajan commented 5 years ago

I don't think this needs to be an either/or. You could likely just add in a further check to your algorithm for generating the single table to flag if the change is secular (i.e. if the flip is persistent) and keep the newest designation. You can then add a "change table" with these flagged changes, recording the previous prime_mover and year of change in the change table.

cmgosnell commented 5 years ago

I was thinking about this type of arrangement for these mostly stable values that tend to change and don't change back. iirc, we assumed this type of flip is such an edge case because most of the repowered generators we checked did just get a new id. But I expect that doing something like this would be more accurate. We'd need to make some assumptions about what persistent flips look like, but that would be doable. If we developed such a process, it could almost certainly be generalizable so we could use it on any of our "static" columns in the entity tables. I think there are a few values we put in the annual bucket because of a similar type of change - the easiest example is the utility address.

cmgosnell commented 4 years ago

subsumed within Issue #446