catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Importing FERC Form 1 for just 2016 fails #232

Closed zaneselvans closed 5 years ago

zaneselvans commented 5 years ago

Describe the bug When initializing the PUDL database, if the only FERC Form 1 year included is 2016, the import fails in pudl.transform.ferc1.fuel() with an infinite recursion. When more than 1 year of data is used to initialize the database, alongside 2016, the process completes successfully.

To Reproduce

import pudl
pudl.init.init_db(
    ferc1_tables=['fuel_ferc1'],
    ferc1_years=[2016],
    eia923_tables=pudl.constants.eia923_pudl_tables,
    eia923_years=[],
    eia860_tables=pudl.constants.eia860_pudl_tables,
    eia860_years=[],
    epacems_years=[],
    epacems_states=pudl.constants.travis_ci_epacems_states,
    verbose=True,
    debug=False,
    pudl_testing=True,
    ferc1_testing=False)

Expected behavior PUDL should be able to import any individual year of data from any of the data sources in isolation.

zaneselvans commented 5 years ago

The problem appears to be coming from the call to cleanstrings that is cleaning up the messy Form 1 fuel types. The bug is only exhibited when the raw input from FERC Form 1 excludes records that have fuel_quantity <= 0 AND when the strings are remapped with all of coal, oil, gas, nuclear, waste and other being part of the simplified taxonomy of fuel types. Excluding any one of those types from the remapping results in the call to cleanstrings succeeding. WTF?

A minimal set of code to reproduce from within a notebook...

import pandas as pd
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..','..','..')))
import pudl
fuel_strings = {
    'coal': pudl.constants.ferc1_coal_strings,
    'oil': pudl.constants.ferc1_oil_strings,
    'gas': pudl.constants.ferc1_gas_strings,
    'nuclear': pudl.constants.ferc1_nuke_strings,
    'waste': pudl.constants.ferc1_waste_strings,
    'other': pudl.constants.ferc1_other_strings
}

def extract_fuel(bork=False):
    fuel_raw = pd.read_sql("SELECT * FROM f1_fuel WHERE report_year=2016", pudl.extract.ferc1.connect_db())
    fuel_raw = fuel_raw[(fuel_raw.fuel != '') & (fuel_raw.plant_name != '')]
    if bork:
        fuel_raw = fuel_raw[fuel_raw.fuel_quantity > 0]
    return fuel_raw

def transform_fuel(fuel_df):
    fuel_df.fuel = pudl.transform.pudl.cleanstrings(fuel_df.fuel, fuel_strings, unmapped='')
    return fuel_df

fuel_raw = extract_fuel(bork=True)
fuel_clean = transform_fuel(fuel_raw)