catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Address 541 utilities that are missing from 2010 EIA 860 data #1262

Closed zaneselvans closed 3 years ago

zaneselvans commented 3 years ago

Debugging foreign key violations that are being surfaced by our more extensive use of natural primary keys, I found that there are 541 invalid [utility_id_eia, report_date] references from the plants_eia860 table to the utilities_eia860 table, all of which pertain to 2010.

plants_eia860 = pd.read_sql("plants_eia860", pudl_engine)
plants_entity = pd.read_sql("plants_entity_eia", pudl_engine)
utils_eia860 =  pd.read_sql("utilities_eia860", pudl_engine)

utils_idx = (
    utils_eia860[["utility_id_eia", "report_date"]]
    .drop_duplicates()
    .set_index(["utility_id_eia", "report_date"])
    .index
)
pu_idx = (
    plants_eia860[["utility_id_eia", "report_date"]]
    .drop_duplicates()
    .dropna()
    .set_index(["utility_id_eia", "report_date"])
    .index
)
missing_idx = (
    pu_idx.difference(utils_idx)
    .to_frame(index=False)
    .astype({"utility_id_eia": int})
    .set_index(["utility_id_eia", "report_date"])
    .index
)

plants_eia860.set_index(["utility_id_eia", "report_date"]).loc[missing_idx]
# 613 plant records, which are almost entirely filled with Null values.

However, these utility_id_eia values do appear in the utilities_eia860 table in other years, just not 2010.

missing_util_ids = (
    pu_idx
    .difference(utils_idx)
    .to_frame(index=False)
    .astype({"utility_id_eia": int})
    .utility_id_eia
)

(
    utils_eia860[utils_eia860.utility_id_eia.isin(missing_util_ids)]
    .report_date
    .value_counts()
    .sort_index()
)
2001-01-01    457
2002-01-01    487
2003-01-01    490
2004-01-01    508
2005-01-01    521
2006-01-01    533
2007-01-01    539
2008-01-01    541
2009-01-01    541
2010-01-01
2011-01-01      2
2012-01-01      4
2013-01-01      3
2014-01-01      3
2015-01-01      3
2016-01-01      3
2017-01-01      4
2018-01-01      4
2019-01-01      4
2020-01-01    211
Name: report_date, dtype: int64

For the moment I am disabling this foreign key constraint, but we should figure out what's going on with these missing utilities. Is it an error in the original data? There have been other irregularities related to utilities in the 2010 reporting in other tables.

Also, why do they almost totally vanish in 2010, and then suddenly crop up in the 2020 data? Is that just something different with the eia860m data?

zaneselvans commented 3 years ago

@cmgosnell Do you have any idea what might be going on here?

What's the right way to address this problem? It looks like all 541 of the utilities appear in the utilities_eia860 table in 2009. If it's a reporting error we could just... copy them forward. I guess we should go spot check the EIA 860 spreadsheets and make sure the utilities really are missing and it's not just that they've gotten clobbered in the ETL somehow.

cmgosnell commented 3 years ago

I wonder if it is related to this issue #1116. There are a ton of plants in 2010 that show up in the ownership table with their owner as their utility_id_eia as opposed to the appropriate owner_utility_id_eia

zaneselvans commented 3 years ago

It turns out that this is an oversight in the current harvesting process, which is inappropriately stripping the utility_id_eia column from plants_eia860 on the first pass through that table. Adding plants_eia860 to this list of exceptions at the end of pudl.transform.eia._compile_all_entity_records() should fix the problem:

# remove the static columns, with an exception
if ((entity in ('generators', 'plants')) and (table_name in ('ownership_eia860', 'utilities_eia860', 'plants_eia860', 'generators_eia860'))):
    cols.remove('utility_id_eia')