Closed davidmudrauskas closed 4 months ago
I've been able to use the Total of All Companies
rows to validate the rest of the data and our understanding of it nicely, at least for the one area
I ran it for (Louisiana). I'm sticking sample code below just for illustration but can roll this into a notebook or other validation mechanism.
This doesn't scale well, so it's worth considering how to run this sparingly or a potentially more efficient framework like Polars or DuckDB.
from math import isnan
def validate_totals(df):
report_years = df.groupby(['report_year']).count().index
for report_year in report_years:
line_atypes = df.groupby(['line', 'atype']).count().index
for line, atype in line_atypes:
prefiltered_df = df[
(df['report_year'] == report_year) &
(df['area'] == 'Louisiana') &
(df['line'] == line) &
(df['atype'] == atype)
]
reported_row = prefiltered_df[
(df['company'] == ' Total of All Companies')
]
calculated_total = prefiltered_df[
(df['company'] != ' Total of All Companies')
]['value'].sum()
if reported_row.shape[0] == 0:
print("WARNING: Company values not aggregated")
reported_total = calculated_total
else:
assert reported_row.shape[0] == 1
reported_total = reported_row['value'].values[0]
print(f"Reported total for year {report_year}, line {line}, atype {atype}: {reported_total}")
if reported_total != calculated_total:
if (line, atype) == ('3014', 'CT'):
# This is not an additive metric ("Alternative Fuel Fleet?(1=Yes,0=No)")
pass
else:
print(f"Calculated total for year {report_year}, line {line}, atype {atype}: {calculated_total}")
assert reported_total == calculated_total or (isnan(reported_total) and calculated_total == 0)
We already depend on DuckDB for the Splink record linkage modules, so if it has to be that or polars, maybe DuckDB is preferable?
Thanks for looking into this, @davidmudrauskas! Sounds like item
is shockingly consistent over time, and the correct variable to use here. Can I go ahead and close this issue?
Yep, sounds good. I also moved to the next task of transposing the entity-attribute-value rows into one row per entity and one column per variable. I can push that progress today.
Sounds great! Happy to take a look this week.
The way variables are identified looks remarkably consistent. Here's a breakdown for all years in the dataset (1997-2022). This maps quite clearly to the current form EIA-176. There are aggregated items (e.g., line
101T
, theTotal of All Companies
values, etc.) so I'll factor those out and see where I can use them expediently to validate things. Then I'll transpose the remaining values into a wide table of company report data in the next task.