catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Running pudl_etl gives KeyError: 'values' #1056

Closed kevinsung closed 2 years ago

kevinsung commented 2 years ago

Describe the bug

Running the pudl_etl script for the fast ETL results in a KeyError.

Bug Severity

How badly is this bug affecting you?

To Reproduce

  1. Follow the development setup instructions to set up a development environment.
  2. In the PUDL working directory, run the first two steps of the fast ETL with the default settings (following the documentation):

    $ ferc1_to_sqlite settings/etl_fast.yml
    $ pudl_etl settings/etl_fast.yml

    Result

    After a bunch of logging output, an error is thrown (only the last few lines of logging are shown here):

    ...
    2021-07-05 14:07:22 [    INFO] pudl.load.csv:46 Loading EPA CEMS hourly_emissions_epacems_2019_id dataframe into CSV                                                                                               
    2021-07-05 14:07:22 [    INFO] pudl.etl:459 Loading EPA CEMS took 00:00:12                                                                                                                                         
    2021-07-05 14:07:28 [    INFO] pudl.load.csv:46 Loading Glue plants_pudl dataframe into CSV                                                                                                                        
    2021-07-05 14:07:28 [    INFO] pudl.load.csv:46 Loading Glue utilities_pudl dataframe into CSV                                                                                                                     
    2021-07-05 14:07:28 [    INFO] pudl.load.csv:46 Loading Glue plants_eia dataframe into CSV                                                                                                                         
    2021-07-05 14:07:28 [    INFO] pudl.load.csv:46 Loading Glue utilities_eia dataframe into CSV                                                                                                                      
    2021-07-05 14:07:28 [    INFO] pudl.load.csv:46 Loading Glue utility_plant_assn dataframe into CSV                                                                                                                 
    2021-07-05 14:07:29 [    INFO] pudl.load.metadata:528 Validating JSON descriptor for epacems-eia tabular data package...                                                                                           
    2021-07-05 14:07:29 [    INFO] pudl.load.metadata:535 JSON descriptor appears valid!                     
    2021-07-05 14:07:29 [    INFO] pudl.load.metadata:540 Validating epacems-eia tabular data package using goodtables_pandas...                                                                                       
    /opt/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/goodtables_pandas/validate.py:127: FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version.              
    
    result = read_table(resource, path=paths)                                                                                                                                                                        
    Traceback (most recent call last):                                                                                                                                                                                 
    File "/opt/miniconda3/envs/pudl-dev/bin/pudl_etl", line 33, in <module>                                                                                                                                          
    sys.exit(load_entry_point('catalystcoop.pudl', 'console_scripts', 'pudl_etl')())                                                                                                                               
    File "/home/kjs/projects/pudl/src/pudl/cli.py", line 115, in main                                                                                                                                                
    _ = pudl.etl.generate_datapkg_bundle(                                                                                                                                                                          
    File "/home/kjs/projects/pudl/src/pudl/etl.py", line 899, in generate_datapkg_bundle                                                                                                                             
    descriptor = pudl.load.metadata.generate_metadata(                                                                                                                                                             
    File "/home/kjs/projects/pudl/src/pudl/load/metadata.py", line 694, in generate_metadata                                                                                                                         
    _ = validate_save_datapkg(datapkg_descriptor, datapkg_dir)                                                                                                                                                     
    File "/home/kjs/projects/pudl/src/pudl/load/metadata.py", line 557, in validate_save_datapkg           
    new_err["values"] = new_err["values"][:5]                                                                                                                                                                      
    KeyError: 'values'

Expected behavior

The command should finish with no error.

Software Environment?

Additional context

The settings YAML file (it's just the default fast ETL settings):

###########################################################################
# FERC FORM 1 DB CLONE SETTINGS
###########################################################################
# if you are loading ferc1, you need to specify a reference year. This is the
# year whose database structure is used as a template.
ferc1_to_sqlite_refyear: 2019
# What years of original FERC data should be cloned into the SQLite DB?
ferc1_to_sqlite_years: [2019]
# A list of tables to be loaded into the local SQLite database. These are
# the table names as they appear in the 2015 FERC Form 1 database.
ferc1_to_sqlite_tables:
  - f1_respondent_id
  - f1_gnrt_plant
  - f1_steam
  - f1_fuel
  - f1_plant_in_srvce
  - f1_hydro
  - f1_pumped_storage
  - f1_purchased_pwr

datapkg_bundle_name: pudl-fast
datapkg_bundle_doi: 10.5072/zenodo.123456 # Sandbox DOI... not real.
datapkg_bundle_settings:
  ###########################################################################
  # FERC FORM 1 SETTINGS
  ###########################################################################
  - name: ferc1
    title: FERC Form 1
    description: A single year of FERC Form 1 data, with all default tables.
    version: 0.1.0
    datasets:
      - ferc1:
          ferc1_tables:
           - fuel_ferc1 # fuel_ferc1 requires plants_steam_ferc1 to load
           - plants_steam_ferc1
           - plants_small_ferc1
           - plants_hydro_ferc1
           - plants_pumped_storage_ferc1
           - plant_in_service_ferc1
           - purchased_power_ferc1
          ferc1_years: [2019]

  ###########################################################################
  # EPA CEMS AND EIA 860/923 SETTINGS
  ###########################################################################
  # EPA CEMS depends on the EIA data. Rather than running the ETL on EIA
  # twice, we assume its inclusion in this datapackage is sufficient for a
  # quick test run.
  - name: epacems-eia
    title: EPA CEMS Hourly Emissions and EIA 860/923
    description: A minimal EPA CEMS ETL run, including one year of Idaho data.
    version: 0.1.0
    datasets:
      - eia:
          # This is the full list of EIA 923 tables.  Many of them are
          # interdependent, and are used in the definition of the overall
          # database, so it is recommended that you import either all of them
          # or none of them. Additionally, there are many relationships between
          # the EIA 923 and EIA 860 tables, and in general they should be
          # imported together.
          eia923_tables:
            - generation_fuel_eia923
            - boiler_fuel_eia923
            - generation_eia923
            - coalmine_eia923 # REQUIRES fuel_receipts_costs_eia923
            - fuel_receipts_costs_eia923
          eia923_years: [2019]
          # See notes above about the entanglement between EIA 923 and EIA 860.
          # It's best to load all the tables from both of them, or neither of
          # them.
          eia860_tables:
            - boiler_generator_assn_eia860
            - utilities_eia860
            - plants_eia860
            - generators_eia860
            - ownership_eia860
          eia860_years: [2019]
          eia860_ytd: True
      - epacems:
          # Note that the CEMS data relies on EIA 860 data for plant locations,
          # so if you're loading CEMS data for a particular year, you should
          # also load the EIA 860 data for that year (2011-2019 only)
          epacems_years: [2019]
          # Just Idaho, because it is tiny:
          epacems_states: [ID]
zaneselvans commented 2 years ago

Hey @kevinsung, when you say you installed pudl with conda, do you mean the (sadly very out of date) v0.3.2 that's available from conda-forge? If so, that's probably why things aren't working correctly. You'll need to clone the repo and install from there instead, especially for the development environment.

kevinsung commented 2 years ago

@zaneselvans I indeed cloned the PUDL repo and installed from there. I simply followed the instructions for "Development Setup" in the PUDL documentation.

kevinsung commented 2 years ago

conda list says my version of catalystcoop-pudl is 0.3.3.dev1039+g2a0fe38f

kevinsung commented 2 years ago

Oh looks like there are newer versions; maybe I'm using the wrong branch (main)?