catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

generators_entity_eia operating_date field blank when op dates disagree #423

Closed gschivley closed 5 days ago

gschivley commented 4 years ago

Describe the bug

I created a PUDL sqlite database with EIA860 and 923 data from 2011-2017. At least one plant (55177) has a NULL operating_date value in the for all generators in generators_entity_eia table. All years show that generators came online in 2001 but the files split (3 to 3) about if it was in March or June.

Bug Severity

How badly is this bug affecting you?

I use operating_date to determine a retirement year for power plants. Null values mean that plants don't have a retirement date, so I need to go back and add them to my settings file by hand.

To Reproduce

Build the datapackages with 860 data from 2011-2017.

Expected behavior

I'm not quite sure what the expected behavior should be. In this specific case the year of operation is consistent across time despite the changing month. Maybe use the most recent value instead of NULL?

zaneselvans commented 4 years ago

Right now, this is the expected behavior -- when any of the supposedly constant fields in 860 is too inconsistent, it gets marked NA/Null, since we want to have a single value associated with these long-lived entities (plants, generators, etc.) but it's not clear what that value should be, in the general case. Is there any reason to think that the most recent value is more likely to be correct than the older values?

In this particular case, it seems like as long as the dates are close to each other they should be considered consistent (like we do with lat/lon, where they're rounded to the nearest 0.01 degrees or something). This would happen in the harvesting functions that @cmgosnell set up.

gschivley commented 4 years ago

I agree that it’s difficult to generalize rules for these disagreements. The fact that it changes and is then consistent makes me trust the new values a bit more.

I’ll take a look to see which other plants have values that change over time.

grgmiller commented 3 months ago

I just wanted to re-flag this issue. I'm noticing missing operating dates for a specific plant (plant 136, generator 2). In the 2022 EIA-860 data, this has an operating date of December 1984, but as far as I can tell this date is missing from pudl. It looks like back in 2005 this generator had an operating date of January 1985, and in more recent versions of 860, it has a reported operating date of December 1984.

Regardless of the ability to create a single static generator_operating_date, it would be nice if there was a way to access any data that exists in the original raw EIA spreadsheets through PUDL, rather than that data being deleted. (Maybe this does exist in one of the output tables and I just haven't been able to find it yet).

I feel like this is a pretty big issue because this means that I have to go back to relying on directly reading raw excel files downloaded from the EIA website to get data I need in these cases.

zaneselvans commented 3 months ago

@grgmiller what version of the data are you working with? I think we may have fixed this in #3340 / #3419.

We don't write the "raw" versions of the spreadsheets into the database because there's no real schema / consistent datatypes for them at that point. They just exist as dataframes.

We could potentially write the pre-harvesting _core_eia* assets to the DB which would help with this kind of debugging without anybody needing to go back to the spreadsheets, but we don't really want people depending on those tables since they contain a ton of duplicate and internally inconsistent data.

grgmiller commented 3 months ago

Locally, I'm working with v2023.12.01 (which I know is outdated), but I also checked on datasette, which I assume is the most recent version of the data. Looking at the two tables mentioned in https://github.com/catalyst-cooperative/pudl/issues/3340, it looks like both the generator_operating_date and current_planned_generator_operating_date columns for plant 136 generator 2 are blank for all records. I also looked in core_eia__entities_generators and core_eia860__scd_generators and couldn't find a non-missing operating date for this generator. Is there a different table that I should be looking in?

I guess maybe the raw/pre-harvest assets may not be appropriate to write to the database, but I feel like it could make sense if:

  1. If there is a "entity" table that is meant to represent static attributes, and that static attribute is not consistent over time, maybe it's appropriate to treat that as an NA (although maybe an alternative approach could be to use the most recently provided or most frequently provided value)
  2. If there is a table that has a "report_date" column, then any column in that table should change over time and represent what was reported in the raw data for that report date, even if not consistent over time.

^ Maybe this is already the design intent, and the issue with plant 136 generator 2 is just a bug rather than an intentional choice to drop the operating date.

grgmiller commented 3 months ago

I'm also noticing that this seems to be affecting several generators with operating dates that are in the early 1900s - not sure if this is also a result of inconsistent operating dates being reported, or if it has to do with the age of the generator, but for example there is no operating date data for plant 62 generators 5 and 6 (built 1915); plant 63 generators 2 and 3 (built 1914), and plant 294 generator 2 (built 1918)

zaneselvans commented 3 months ago

I think it's mostly already set up the way you're describing.

When multiple conflicting values of a given attribute are reported, there's a process for selecting which one gets kept, if any. The default process is "Pick the most frequently reported value, as long as it accounts for more than 70% of the non-null values that were observed for this attribute", and if no value meets this criteria, it is set to NA.

For some fields, we have other methods of deciding which value to keep, the simplest being setting a lower minimum consistency (less than 70%).

We can't straightforwardly report the raw value for any value in a table with a report_date column, because within a single year the same attribute is often reported in many different tables, with potentially conflicting values.

In the case of generator_operating_date that really is a static value -- the generator went into service on some past date and it shouldn't ever change. So that field comes from the core_eia__entity_generators table. Any value of that field which you see in any other core_ or out_ table is ultimately derived from the value you find in core_eia__entity_generators which is the single source of truth for that value after we've normalized the data. So if it's NA in the entity table, it'll be NA everywhere else downstream.

And if it's NA in the entity table, that means that the operating date was reported inconsistently, so we don't know what it really was. We could switch to taking the first reported value, or the last reported value, or the average of all observed values, or we could give the values found in one spreadsheet tab that we consider to be the authoritative home for the attribute a higher weight, etc. But we don't have a clean abstraction for choosing among a menu of different selection processes for different attributes right now.

I'll take a look at the _core_eia860__generators asset that I have cached locally and see what's going on with the value for that plant specifically.

zaneselvans commented 3 months ago

As expected, there are 2 operating dates getting reported, with roughly equal frequency. But it seems like #3419 should have addressed this case. @e-belfer do you know why this case would have slipped through that fix?

report_date plant_id_eia generator_id generator_operating_date
14384 2001-01-01 00:00:00 136 2 1985-01-01 00:00:00
31220 2002-01-01 00:00:00 136 2 1985-01-01 00:00:00
48252 2003-01-01 00:00:00 136 2 1985-01-01 00:00:00
65521 2004-01-01 00:00:00 136 2 1985-01-01 00:00:00
83213 2005-01-01 00:00:00 136 2 1985-01-01 00:00:00
101258 2006-01-01 00:00:00 136 2 1985-01-01 00:00:00
119481 2007-01-01 00:00:00 136 2 1985-01-01 00:00:00
138555 2008-01-01 00:00:00 136 2 1985-01-01 00:00:00
13569 2009-01-01 00:00:00 136 2 1985-01-01 00:00:00
18174 2010-01-01 00:00:00 136 2 1985-01-01 00:00:00
67756 2012-01-01 00:00:00 136 2 1985-01-01 00:00:00
86481 2013-01-01 00:00:00 136 2 1985-01-01 00:00:00
105634 2014-01-01 00:00:00 136 2 1985-01-01 00:00:00
112856 2015-01-01 00:00:00 136 2 1984-12-01 00:00:00
132914 2016-01-01 00:00:00 136 2 1984-12-01 00:00:00
153637 2017-01-01 00:00:00 136 2 1984-12-01 00:00:00
175075 2018-01-01 00:00:00 136 2 1984-12-01 00:00:00
197204 2019-01-01 00:00:00 136 2 1984-12-01 00:00:00
219934 2020-01-01 00:00:00 136 2 1984-12-01 00:00:00
243350 2021-01-01 00:00:00 136 2 1984-12-01 00:00:00
267995 2022-01-01 00:00:00 136 2 1984-12-01 00:00:00
293379 2023-01-01 00:00:00 136 2 1984-12-01 00:00:00
grgmiller commented 3 months ago

Thanks for the explanation - that makes sense. My guess about why https://github.com/catalyst-cooperative/pudl/pull/3419 didn't work is that even though these two dates are within a month of each other, they are in different years. To resolve this, you might need to look within a 12-month window, rather than within the same year.

zaneselvans commented 3 months ago

Ah okay, I didn't realize it was based on calendar year and not a rolling window. Maybe this is an easy improvement to make.

e-belfer commented 3 months ago

That's exactly what's happening. When implementing, I considered this and didn't wind up choosing the window option, partially because it's slightly more challenging to implement but primarily because it raises the question of how to collapse multiple dates within twelve month windows of one another.

An example (with faked data) to hopefully clarify:

plant_id_eia generator_id report_date generator_operating_date
1336 S2 2010-01-01 1973-04-01
1336 S2 2011-01-01 1973-06-01
1336 S2 2012-01-01 1974-05-01
1336 S2 2013-01-01 1975-01-01
1336 S2 2014-01-01 1977-01-01

E.g., for a plant and generator like this, where the dates for all but the last row are all within 12 months of one another but the cumulative time difference between them is more than 12 months, do we want to treat them all as having a date of 1973-04-01? Programatically it's not impossible but it becomes more complex to explain what's going on under the hood. If we want to revisit this decision we should look to #3340.

The bigger question Greg raises is whether to change our dropping behavior or our harvesting behavior for these columns - even if these dates disagreed by more than a year, perhaps there's still value in keeping a generator_operating_date and flagging that it isn't harvested. That seems like a bigger design question.

zaneselvans commented 3 months ago

@e-belfer Rather than using the calendar year, would it be easy / make sense to groupby the generator ID, calculate the difference between the min and max values of operating date, and if it's less than 1-year take the most recently reported value? How many of the stragglers would be left? It sounded like there were some genuinely nonsensical cases where the various operating dates are separated by decades too.

gschivley commented 3 months ago

Adding my 2 cents here. Some capacity expansion models retire a plant's capacity it reaches an "end of lifetime" age. For models that do this, it's more important to have an approximately correct operating date than choose between a correct or missing date.

gschivley commented 1 month ago

@zaneselvans I see that this was at least partially addressed in the v2024.5.0 release. I'm going to take a look but do you know how much generator_operating_date coverage has improved?

gschivley commented 1 month ago

I think there are fewer generators with NULL operating dates in the newest version but it's still a non-trivial number.

Is it possible to report all reported operating dates rather than NULL? I know it's potentially messy but NULL is unhelpful (unless no date has ever been reported). I end up merging it in from the 861 excel file. It would be great to have everything within PUDL instead.

zaneselvans commented 1 month ago

To include ALL of the dates that were ever reported, structurally we would have to move the dates from the entity table to the slowly changing dimension table, which could be done, but also semantically it seems... wrong? Like generators really should have one operating and one retirement date. IIRC when I investigated the distribution of ambiguous dates, picking the last (or first, etc) date within a rolling 2-year window rather than using calendar year filled in nearly all of the missing (because ambiguous) values. And then there were just a handful with truly wild discrepancies of up to multiple decades.

For the small number of generators with ambiguous / multiply reported values, how would you go about choosing one as the canonical value?

gschivley commented 1 month ago

Is there a place in the pipeline where I can see the discrepancies? Could the large discrepancies have anything to do with diesel gensets that get moved from one location to another?

From the user perspective, it feels like out_eia__yearly_generators should include the information from 860 in each reporting year. I understand that it's a mix of static attributes, calculated data, etc. But it would make sense to see the reported operating date in each row.

Is there a strict need to choose the canonical value? Maybe it's a matter of taking a normative vs empirical perspective on the data. There should be a single date that every generator came online but in practice all we have is the reported values.

e-belfer commented 1 week ago

@gschivley The notebook I used to originally view and debug these differences is linked in this comment: https://github.com/catalyst-cooperative/pudl/issues/3340#issuecomment-1936140757. I can pickle a dataframe of the most dramatic differences and share it here tomorrow morning.

FYI - I'm picking back up on this issue since it's been a consistent thorn for multiple people, so expect further updates from me in the coming weeks.

e-belfer commented 1 week ago

Certainly agree about the 365 day window instead of the calendar year, that seems like an obvious improvement. That still leaves some edge cases (<10 generators).

Here are the edge cases I'm seeing, where the dates disagree by more than a year. Not many in number, and a few are just over the year window, but there are some much larger discrepancies here. Also downloadable here as a pickle.

plant_id_eia generator_id generator_operating_date report_date operational_status
1391 4A 1982-09-01 2001-01-01 existing
1391 4A 1987-08-01 2011-01-01 existing
1403 6(4) 1971-05-01 2011-01-01 existing
1403 6(4) 1992-03-01 2001-01-01 existing
1404 7A 1973-04-01 2011-01-01 existing
1404 7A 1974-09-01 2001-01-01 existing
1404 7B 1973-04-01 2011-01-01 existing
1404 7B 1974-09-01 2001-01-01 existing
2115 IC6 1963-03-01 2010-01-01 existing
2115 IC6 1965-09-01 2015-01-01 existing
2869 1A 1973-06-01 2014-01-01 existing
2869 1A 1983-06-01 2001-01-01 existing
3604 6A 1957-06-01 2013-01-01 existing
3604 6A 1997-06-01 2001-01-01 existing
58265 S-13 2004-06-01 2016-01-01 existing
58265 S-13 2006-06-01 2012-01-01 existing
58265 S-17 2004-06-01 2012-01-01 existing
58265 S-17 2006-11-01 2016-01-01 existing
zaneselvans commented 1 week ago

A couple of those 10+ year discrepancies look like maybe they could be typos or OCR errors, with just a single digit difference in the date. 1957 vs 1997? 1973 vs 1983?

e-belfer commented 1 week ago

Decision moving forward is to prioritize filling these nulls:

If we can't resolve the generator operating dates through manual research, we can resolve these 5 by taking, say the most recent date with >40% consistency.

@gschivley Let me know how this sounds to you or if you have any feedback.

gschivley commented 1 week ago

If we focus on the larger discrepancies, here's what I've come up with so far:

plant_id_eia recent report is older/newer notes
1391 Newer (1982 -> 1985) No idea
1403 Older (1992 -> 1971) Maybe it was originally generator_id 4 then repowered in 1992 and renamed 6(4)
2869 Older (1983 -> 1973) I think this was a data error. Unit 1B has always been listed as operating in 1973. I found a plant contact and emailed them. The out_ferc1__yearly_steam_plants_sched402 also shows a plant construction/operation year of 1973.
3604 Older (1997 -> 1957) There was a repowering in 1997

It looks like all the major changes happen between 2011-2014. Maybe EIA put out guidance to report original online date rather than repower/modification date? Ironically, if the newer operating date is a repower it is probably a better value for my users.

gschivley commented 1 week ago

Just heard back from West Lorain (2869). Both 1A and 1B came online in 1973, no idea why it was reported wrong at first. Maybe OCR like @zaneselvans suggested. Or a typo.

gschivley commented 6 days ago

If we can't resolve the generator operating dates through manual research, we can resolve these 5 by taking, say the most recent date with >40% consistency.

@e-belfer are you saying take the most recent date so long as it accounts for >40% of reported values? Or the most recently reported date? I'd lean towards saying recently reported dates are more likely to be "correct" (although the 1992 repowering of a turbine originally installed in 1971 is information I wish EIA was reporting).

e-belfer commented 5 days ago

@gschivley Thanks for your investigative work here, it's been very helpful!

It seems like the manual cases we have confirmation on, we wind up just wanting to take the most recently reported date. So rather than ending up with 10 different rules, let's simplify here and just take the last reported generator operating date when no consistent value can be found.

zaneselvans commented 4 days ago

@gschivley & @grgmiller this change got merged into main last night and the nightly builds passed, so you should be able to grab a new copy of the DB and find entirely non-null generator operating dates.