catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

KeyError during ETL for 2015 and earlier data #508

Closed grgmiller closed 4 years ago

grgmiller commented 4 years ago

Describe the bug

When I run the ETL for data prior to 2016 (I got the same error for both 2014 and 2015), I get the following error:

2020-01-16 15:57:05 [    INFO] pudl.transform.eia923:571 Transforming raw EIA 923 DataFrames for fuel_receipts_costs_eia923 concatenated across all years.
Traceback (most recent call last):
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'natural_gas_delivery_contract_type_code'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\gmiller7\anaconda3\envs\pudl\Scripts\pudl_etl-script.py", line 9, in <module>
    sys.exit(main())
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\cli.py", line 94, in main
    pudl.etl.generate_data_packages(
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\etl.py", line 790, in generate_data_packages
    pkg_tables = etl_pkg(pkg_settings, pudl_settings, pkg_bundle_dir)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\etl.py", line 707, in etl_pkg
    tbls = _etl_eia_pkg(
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\etl.py", line 187, in _etl_eia_pkg
    pudl.transform.eia923.transform(eia923_raw_dfs,
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\transform\eia923.py", line 574, in transform
    eia923_transform_functions[table](eia923_raw_dfs,
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\transform\eia923.py", line 492, in fuel_receipts_costs
    frc_df.pipe(_coalmine_cleanup).
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pandas\core\generic.py", line 5028, in pipe
    return com._pipe(self, func, *args, **kwargs)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pandas\core\common.py", line 483, in _pipe
    return func(obj, *args, **kwargs)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pudl\helpers.py", line 359, in cleanstrings
    out_df[col], str_map, unmapped=unmapped, simplify=simplify)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pandas\core\frame.py", line 2995, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\gmiller7\anaconda3\envs\pudl\lib\site-packages\pandas\core\indexes\base.py", line 2899, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas\_libs\index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'natural_gas_delivery_contract_type_code'

Bug Severity

How badly is this bug affecting you?

To Reproduce

Steps to reproduce the behavior -- ideally including a code snippet that causes the error to appear.

In anaconda prompt, I ran: (pudl) C:\Users\gmiller7\Box\PUDL>pudl_etl settings/2014_data.yml

The YAML file contains the following:

# This file controls the PUDL ETL process, and is used as input to pudl_etl

pkg_bundle_name: 2014_data
pkg_bundle_settings:
  - name: epacems
    title: EPA Continuous Emissions Monitoring System Hourly
    description: Hourly emissions, power output, heat rates, and other data for most US fossil fuel plants.
    datasets:
      - eia:
          eia923_tables:
            - generation_fuel_eia923
            - boiler_fuel_eia923
            - generation_eia923
            - coalmine_eia923
            - fuel_receipts_costs_eia923
          eia923_years: [2014]
          eia860_tables:
            - boiler_generator_assn_eia860
            - utilities_eia860
            - plants_eia860
            - generators_eia860
            - ownership_eia860
          eia860_years: [2011,2012,2013,2014,2015,2016,2017]
      - epacems:
          epacems_years: [2014]
          epacems_states: [ALL]
          partition:
            hourly_emissions_epacems: epacems_years

Expected behavior

I expected the ETL process to finish without error, like it did for the 2016 and 2017 data

Software Environment?

Additional context

I am running the ETL for an entire year at at time, for the entire US

zaneselvans commented 4 years ago

The system is really meant to run with the same years for 923 ad 860 -- we really haven't done any testing with different years in those two datasets (except insofar as the 2009-2010 data for 860 hasn't been integrated yet, and it is pulled in for 923), so I would recommend running the ETL for all of the years of data that you want to have access to all at once -- across 860, 923, and CEMS.

If you want to load all the years/states that are available within CEMS (or a big chunk of them anyway) I would recommend creating two data packages in the bundle -- one with 923 and 860, and another with 860, 923, and CEMS. Then you can load the EIA data into the SQLite DB, and convert the CEMS data to Apache Parquet files from the datapackge (with the epacems_to_parquet script) since loading the entire CEMS dataset into SQLite takes an unknown amount of time that is longer than 30 hours (on my laptop at least). You can slurp the Parquet files into Pandas or Dask dataframes directly, and (since they were generated in the same ETL run as the EIA data) should be able to smoothly merge them with the EIA data as needed in you analysis / calculations.

grgmiller commented 4 years ago

I was including all years for EIA 860 due to the workaround in https://github.com/catalyst-cooperative/pudl/issues/467

However I can try re running the ETL with eia860_years: [2011,2012,2013,2014] or just eia860_years: [2014] and update if it works

zaneselvans commented 4 years ago

It looks like from the settings file you included up there that only the 2014 EIA 923 data was included though. They're pretty entangled. Once we debug the 2009-2010 EIA 860 data we might just require the same years for both of them all the time. Dunno.