catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
470 stars 108 forks source link

Fix or drop `plant_in_service` rows that don't add up #2066

Open zaneselvans opened 1 year ago

zaneselvans commented 1 year ago

A small number of records in the plant_in_service_ferc1 table do not sum up to their reported ending_balance. If we calculate the expected value and compare it to the reported value we find:

cols_to_sum = [
    "starting_balance",
    "additions",
    "retirements",
    "adjustments",
    "transfers",
]
pis["ending_balance_calc"] = (
    pis.loc[:, cols_to_sum].copy().fillna(0).sum(axis="columns")
)
bad_calc = pis[
    ~np.isclose(pis.ending_balance, pis.ending_balance_calc)
    & (pis.ending_balance.notna())
].copy()
bad_calc.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 205 entries, 775 to 59877
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   utility_id_ferc1     205 non-null    Int64   
 1   report_year          205 non-null    Int64   
 2   ferc_account_label   205 non-null    string  
 3   ferc_account         148 non-null    string  
 4   row_type_xbrl        205 non-null    category
 5   starting_balance     203 non-null    float64 
 6   additions            177 non-null    float64 
 7   retirements          135 non-null    float64 
 8   adjustments          47 non-null     float64 
 9   transfers            92 non-null     float64 
 10  ending_balance       205 non-null    float64 
 11  record_id            205 non-null    string  
 12  ending_balance_calc  205 non-null    float64 
dtypes: Int64(2), category(1), float64(7), string(3)
memory usage: 21.5 KB
bad_calc.report_year.value_counts()
1996    96
1997    45
1998    35
1994    16
1995    13
Name: report_year, dtype: Int64

The number of bad records is small, but some of them are very bad Like off by a factor of 10,000,000x bad. This will break aggregation operations across categories seriously and needs to be addressed somehow.

Possible sources of error

e-belfer commented 1 year ago

Poked around a bit. One finding: about 15% of these bad_calc values have a balance difference that's exactly equal to the transfer column.

pis['balance_diff'] = pis['ending_balance_calc']-pis['ending_balance']
cols_to_check = ['record_id', 'report_year',
    "starting_balance",
    "additions",
    "retirements",
    "adjustments",
    "transfers",
    'ending_balance',
    'ending_balance_calc',
    'balance_diff'
]
subset = pis.loc[pis.record_id.isin(bad_calc.record_id), pis.columns.isin(cols_to_check)]
trans = subset.loc[subset.transfers == subset.balance_diff]
trans.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 36 entries, 2812 to 50179
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   report_year          36 non-null     Int64  
 1   starting_balance     36 non-null     float64
 2   additions            36 non-null     float64
 3   retirements          33 non-null     float64
 4   adjustments          2 non-null      float64
 5   transfers            36 non-null     float64
 6   ending_balance       36 non-null     float64
 7   record_id            36 non-null     string 
 8   ending_balance_calc  36 non-null     float64
 9   balance_diff         36 non-null     float64
zaneselvans commented 1 year ago

Huh, that's interesting. Should we adjust the ending balances to account for the transfers?

Looking at the dbf_to_xbrl.csv file that tracks which years form changes happened in, it seems unlikely that these errors are due to respondents using the wrong version of the form. Thef1_plant_in_service table was unchanged from 1994 until 2003.