catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

init_pudl.py fails on EIA form 923 boiler_fuel #143

Closed karldw closed 6 years ago

karldw commented 6 years ago

Running init_pudl.py (with the default arguments) fails when the code tries to ingest the EIA 923 boiler data for 2009. Do you have any tips?


Reading EIA 923 spreadsheet data
    2009...
    2010...
    2011...
    2012...
    2013...
    2014...
    2015...
    2016...
Converting EIA 923 generation_fuel to DataFrame...
Converting EIA 923 stocks to DataFrame...
Converting EIA 923 boiler_fuel to DataFrame...
Traceback (most recent call last):
  File "init_pudl.py", line 105, in <module>
    sys.exit(main())
  File "init_pudl.py", line 101, in main
    keep_csv=args.keep_csv)
  File "pudl/pudl/init.py", line 1254, in init_db
    csvdir=csvdir, keep_csv=keep_csv)
  File "pudl/pudl/init.py", line 1115, in ingest_eia923
    verbose=verbose)
  File "pudl/pudl/extract/eia923.py", line 163, in get_eia923_page
    to_drop = [c for c in newdata.columns if c[:8] == 'reserved']
  File "pudl/pudl/extract/eia923.py", line 163, in <listcomp>
    to_drop = [c for c in newdata.columns if c[:8] == 'reserved']
TypeError: 'float' object is not subscriptable

When I print out newdata.columns, it looks like it's missing headers:

Index([               nan,                'n',    'greene_county',
       'alabama_power_co',                nan,               'al',
                    'esc',             'serc',                nan,
                      nan, 'electric_utility',              'dfo',
                  'dfo_1',          'barrels',                nan,
                      nan,         '102719_1',         '102719_2',
               '102719_3',         '102719_4',         '102719_5',
               '102719_6',         '102719_7',         '102719_8',
               '102719_9',        '102719_10',                nan],
zaneselvans commented 6 years ago

Hmm. This does not look familiar. I wonder if they might have changed the file layouts when they moved everything into the archive directory. Grr. @stevenbwinter has mostly worked on the mapping of the spreadsheet rows & columns. We will take a closer look.

zaneselvans commented 6 years ago

@stevenbwinter is looking into this. It appears that they've retroactively added a tab to the 923 spreadsheets, containing information about oil stocks, which is throwing off the parsing, since the tabs are read in based on their order in the spreadsheets. If that's the only change, it should be easy to fix.

zaneselvans commented 6 years ago

Okay, @karldw both @swinter2011 and I have been able to completely wipe out our datastores and re-initialize the PUDL DB, after changing the parser to accommodate the new tabs EIA added to the spreadsheets.

karldw commented 6 years ago

Works for me too!