catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Merge sector codes from EIA 923 into plants_860 data #2193

Closed TrentonBush closed 1 year ago

TrentonBush commented 1 year ago

Describe the issue

Many plants are missing sector categories and codes (eg IPP vs utility) despite this information existing in EIA923. This is because PUDL currently includes sector information only from EIA 860, which has different data coverage than EIA 923. This issue is most severe prior to 2009 (see table below).

For example, in 2009, the plant data in EIA860 includes only 6816 plants, whereas EIA923 covers 9907. Furthermore, the EIA860 sector codes include 203 missing values compared to 0 missing in EIA923. Prior to 2009, EIA 860 does not report sector categories at all.

report_date n_null_sector total_plants frac_null
2001-01-01 5886 5886 1.0
2002-01-01 5832 5832 1.0
2003-01-01 5843 5843 1.0
2004-01-01 5911 5911 1.0
2005-01-01 6029 6029 1.0
2006-01-01 6193 6193 1.0
2007-01-01 6429 6429 1.0
2008-01-01 6622 6622 1.0
2009-01-01 223 6836 0.0326
2010-01-01 69 7280 0.0094
2011-01-01 29 6880 0.0042
2012-01-01 30 7314 0.0041
2013-01-01 28 8088 0.0034
2014-01-01 30 8550 0.0035
2015-01-01 31 8959 0.0034
2016-01-01 35 9746 0.0035
2017-01-01 34 10161 0.0033
2018-01-01 34 11014 0.0030
2019-01-01 33 11866 0.0027
2020-01-01 34 12670 0.0026
2021-01-01 33 13488 0.0024

Impact

I noticed this while comparing fuel cost records from the fuel_receipts_costs table to aggregate values from the EIA bulk electricity data. The bulk data is aggregated by sector (among other things). The missing sector codes mean that I cannot use PUDL to reproduce the bulk aggregates.

Even in 2011 and beyond, when only about 34 plants (0.3%) are missing sector codes, those plants are disproportionately active in the fuel receipts records and produce 9869 entries (2.3%) with null sector codes.

To Reproduce

The SQL query to reproduce the table above is:

SELECT
    report_date,
    SUM(sector_id_eia is NULL) as n_null_sector,
    COUNT(*) as total_plants,
    CAST(SUM(sector_id_eia is NULL) as FLOAT) / count(*) as frac_null
FROM plants_eia860
WHERE report_date < date('2022-01-01') -- exclude 860m data
GROUP BY 1
ORDER BY 1 asc

PUDL Version

This issue is present on the dev branch as of today (last commit hash: db5ce8f2b5ee062708d4700fe6a6391c23a1005f)

zaneselvans commented 1 year ago

My guess is that this is due to / related to #509 -- that Once Upon A Time we thought the sector code would come authoritatively from the EIA 860 and so preemptively dropped it from the EIA 923 data, rather than feeding it into the harvesting process. @knordback has been poking at this issue in #2169

I would guess that just by retaining the sector codes in all the input tables from 923 we would end up with much better coverage coming out of the harvesting step.

knordback commented 1 year ago

I can focus on this one if desired, as it's more targeted and sounds higher priority than other parts of #509

zaneselvans commented 1 year ago

I think this would be a great targeted example. IIRC the EIA sector codes show up in almost every EIA-923 table that references plants.