Find a way to unique identify variables over time: Is `item` consistent over time, even when line numbers change? If not, we need to map changes `line` or `itemsort` over time.

davidmudrauskas commented 5 months ago

The way variables are identified looks remarkably consistent. Here's a breakdown for all years in the dataset (1997-2022). This maps quite clearly to the current form EIA-176. There are aggregated items (e.g., line 101T, the Total of All Companies values, etc.) so I'll factor those out and see where I can use them expediently to validate things. Then I'll transpose the remaining values into a wide table of company report data in the next task.

line item                                               itemsort       count
1010 Residential Sales Consumers                        [10.1]         34920
     Residential Sales Revenue                          [10.1]         34905
     Residential Sales Volume                           [10.1]         34910
101T Residential Volume                                 [10.1 + 11.1]  35266
1020 Commercial Sales Consumers                         [10.2]         34006
     Commercial Sales Revenue                           [10.2]         33961
     Commercial Sales Volume                            [10.2]         33972
102T Commercial Volume                                  [10.2 + 11.2]  35005
1030 Industrial Sales Consumers                         [10.3]         22263
     Industrial Sales Revenue                           [10.3]         22108
     Industrial Sales Volume                            [10.3]         22178
103T Industrial Volume                                  [10.3 + 11.3]  27541
1040 Electric Power Sales Consumers                     [10.4]          4170
     Electric Power Sales Revenue                       [10.4]          4011
     Electric Power Sales Volume                        [10.4]          4103
104T Electric Power Volume                              [10.4 + 11.4]   9937
1050 Vehicle Fuel Sales Consumers                       [10.5]          3812
     Vehicle Fuel Sales Revenue                         [10.5]          3792
     Vehicle Fuel Sales Volume                          [10.5]          3815
105T Vehicle Fuel Volume                                [10.5 + 11.5]   4205
1060 Other Sales Consumers                              [10.6]           239
     Other Sales Revenue                                [10.6]           229
     Other Sales Volume                                 [10.6]           268
106T Other Volume                                       [10.6 + 11.6]    333
110  Production Volume                                  [01.1]          6010
1110 Residential Transport Consumers                    [11.1]          2708
     Residential Transport Revenue                      [11.1]          2140
     Residential Transport Volume                       [11.1]          2640
1120 Commercial Transport Consumers                     [11.2]          7252
     Commercial Transport Revenue                       [11.2]          5743
     Commercial Transport Volume                        [11.2]          7250
1130 Industrial Transport Consumers                     [11.3]         13621
     Industrial Transport Revenue                       [11.3]          9132
     Industrial Transport Volume                        [11.3]         13640
1140 Electric Power Transport Consumers                 [11.4]          7976
     Electric Power Transport Revenue                   [11.4]          4375
     Electric Power Transport Volume                    [11.4]          7972
1150 Vehicle Fuel Transport Consumers                   [11.5]          1112
     Vehicle Fuel Transport Revenue                     [11.5]           553
     Vehicle Fuel Transport Volume                      [11.5]          1140
1160 Other Transport Consumers                          [11.6]           109
     Other Transport Volume                             [11.6]           115
120  Synthetic Production Volume                        [01.2]           240
1210 Facility Space Heat                                [12.1]         10522
1220 New Pipeline Fill Volume                           [12.2]          2009
1230 Pipeline dist/storage compressor use               [12.3]          9367
1240 Vaporization/liquefaction/LNG fuel                 [12.4]          3322
1250 Vehicle fuel used in company fleet                 [12.5]          1161
1260 Other                                              [12.6]          1476
1310 Underground Storage Injections Volume              [13.1]          4697
1320 LNG Storage Injections Volume                      [13.2]          2394
1330 Above Ground Storage Injections Volume             [13.3]            77
1400 Deliveries out of State Volume                     [14.0]          9504
1500 Lease Use Volume                                   [15.0]          1432
1600 Returns for Repress./Reinjection Volume            [16.0]           319
1700 Losses from Leaks Volume                           [17.0]         22090
1810 Disposition to Distribution Companies Volume       [18.1]         10601
1820 Disposition to Other Pipelines Volume              [18.2]         12513
1830 Disposition to Storage Operators Volume            [18.3]          3479
1840 Disposition to Other Volume                        [18.4]          4505
1900 Total Disposition Volume                           [19.0]         52149
2000 Unaccounted For                                    [20.0]         46260
210  Underground Storage Withdrawals Volume             [02.1]          4813
220  LNG Storage Withdrawals Volume                     [02.2]          2558
230  Above Ground Storage Withdrawals Volume            [02.3]           137
3    Natural Gas Pump Price                             [00.0]           339
300  Receipts from State or US border Volume            [03.0]         10477
3014 Alternative Fleet Size                             [  B.2]         4484
     Alternative Fuel Fleet?(1=Yes,0=No)                [  B.1]         2279
400  Receipts at Citygate Volume                        [04.0]         11470
410  Receipts at Citygate Delivered to Sales Custome... [04.1]          3869
     Receipts at Citygate Delivered to Sales Custome... [04.1]         15838
420  Receipts at Citygate Delivered to Transportatio... [04.2]          3479
5    Customer Choice: Residential (Eligible)            [  C.1]         1912
500  Other Receipts Volume                              [05.0]         24627
6    Customer Choice: Residential (Participating)       [  C.2]         1810
600  Supplemental Gaseous Fuels Volume                  [06.0]          1827
700  Total Supply Volume                                [07.0]         52151
800  LNG Inventory at end of Year Volume                [08.0]           981
810  LNG Facility Year End Capacity                     [08.1]           861
     LNG Facility Year End Volume                       [08.1]          1545
820  Marine Terminal Facility Year End Capacity         [08.2]           121
     Marine Terminal Facility Year End Volume           [08.2]           164
9    Sales/Acquisitions?(1=Yes,0=No)                    [  D.1]          474
900  Heat Content of Delivered Gas (BTU/cf)             [09.0]         46305

davidmudrauskas commented 5 months ago

I've been able to use the Total of All Companies rows to validate the rest of the data and our understanding of it nicely, at least for the one area I ran it for (Louisiana). I'm sticking sample code below just for illustration but can roll this into a notebook or other validation mechanism.

This doesn't scale well, so it's worth considering how to run this sparingly or a potentially more efficient framework like Polars or DuckDB.

from math import isnan

def validate_totals(df):
    report_years = df.groupby(['report_year']).count().index

    for report_year in report_years:
        line_atypes = df.groupby(['line', 'atype']).count().index

        for line, atype in line_atypes:
            prefiltered_df = df[
                (df['report_year'] == report_year) &
                (df['area'] == 'Louisiana') &
                (df['line'] == line) &
                (df['atype'] == atype)
            ]

            reported_row = prefiltered_df[
                (df['company'] == ' Total of All Companies')
            ]

            calculated_total = prefiltered_df[
                (df['company'] != ' Total of All Companies')
            ]['value'].sum()

            if reported_row.shape[0] == 0:
                print("WARNING: Company values not aggregated")
                reported_total = calculated_total
            else:
                assert reported_row.shape[0] == 1
                reported_total = reported_row['value'].values[0]

            print(f"Reported total for year {report_year}, line {line}, atype {atype}: {reported_total}")
            if reported_total != calculated_total:
                if (line, atype) == ('3014', 'CT'):
                    # This is not an additive metric ("Alternative Fuel Fleet?(1=Yes,0=No)")
                    pass
                else:
                    print(f"Calculated total for year {report_year}, line {line}, atype {atype}: {calculated_total}")
                    assert reported_total == calculated_total or (isnan(reported_total) and calculated_total == 0)

zaneselvans commented 5 months ago

We already depend on DuckDB for the Splink record linkage modules, so if it has to be that or polars, maybe DuckDB is preferable?

e-belfer commented 4 months ago

Thanks for looking into this, @davidmudrauskas! Sounds like item is shockingly consistent over time, and the correct variable to use here. Can I go ahead and close this issue?

davidmudrauskas commented 4 months ago

Yep, sounds good. I also moved to the next task of transposing the entity-attribute-value rows into one row per entity and one column per variable. I can push that progress today.

e-belfer commented 4 months ago

Sounds great! Happy to take a look this week.

catalyst-cooperative / pudl

Find a way to unique identify variables over time: Is `item` consistent over time, even when line numbers change? If not, we need to map changes `line` or `itemsort` over time. #3501