catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Fix FIPS coding in PUDL #3884

Open e-belfer opened 1 week ago

e-belfer commented 1 week ago

Is your feature request related to a problem? Please describe. In #3531 we've learned that the addfips package we rely on to map state/county data into a FIPS code is irregularly maintained. This is causing problems for downstream users, and we should consider deprecating this package and moving to a different solution.

Describe the solution you'd like There are a few different use cases for FIPS encoding that we could hypothetically imagine for PUDL. Right now, add_fips_ids in pudl.src.helpers.py is the only place we import the addfips package. This method is expects a county and state name, and a FIPS vintage:

def add_fips_ids(
    df: pd.DataFrame,
    state_col: str = "state",
    county_col: str = "county",
    vintage: int = 2015,
) -> pd.DataFrame:
    """Add State and County FIPS IDs to a dataframe.

    To just add State FIPS IDs, make county_col = None.
    """
    # force the columns to be the nullable string types so we have a consistent
    # null value to filter out before feeding to addfips
    df = df.astype({state_col: pd.StringDtype()})
    if county_col:
        df = df.astype({county_col: pd.StringDtype()})
    af = addfips.AddFIPS(vintage=vintage)
    # Lookup the state and county FIPS IDs and add them to the dataframe:
    df["state_id_fips"] = df.apply(
        lambda x: (
            af.get_state_fips(state=x[state_col]) if pd.notnull(x[state_col]) else pd.NA
        ),
        axis=1,
    )

    # force the code columns to be nullable strings - the leading zeros are
    # important
    df = df.astype({"state_id_fips": pd.StringDtype()})

    logger.info(
        f"Assigned state FIPS codes for "
        f"{len(df[df.state_id_fips.notnull()])/len(df):.2%} of records."
    )
    if county_col:
        df["county_id_fips"] = df.apply(
            lambda x: (
                af.get_county_fips(state=x[state_col], county=x[county_col])
                if pd.notnull(x[county_col]) and pd.notnull(x[state_col])
                else pd.NA
            ),
            axis=1,
        )
        # force the code columns to be nullable strings - the leading zeros are
        # important
        df = df.astype({"county_id_fips": pd.StringDtype()})
        logger.info(
            f"Assigned county FIPS codes for "
            f"{len(df[df.county_id_fips.notnull()])/len(df):.2%} of records."
        )
    return df

Approaches to resolving the problem

County and state names -> FIPS Code

This is our current FIPS encoding use case. We could:

Address -> Lat lon -> FIPS Code or lat/lon -> FIPS code

As mentioned above, this wouldn't address the use case for 2/3 EIA datasets we're currently using the addfips package for. See pygris package, which seems interesting if we want to explore this further, but this is out of scope for the actual problem we're trying to resolve. See also #3531 for some discussion on other ways to implement this.

Proposed solutions

  1. Fork addfips and implement our desired changes, then do our own release so we can import this as a package. Pros: We don't have to write much code, and we can keep up to date with any changes made in addfips. Cons: Extremely annoying and we're likely to forget about this. We're still relying on a largely unmaintained package.

  2. Recreate the logic of addfips in pudl.helpers.py using Census data directly (e.g., this 2023 data or by calling the Census's geometry API, see [here](https://www.census.gov/data/developers/data-sets/geo-info.html) for more info). Pros: We are our own maintainers. Cons: We have to recreate the fuzzy-matching logic in the package we're currently using to prevent performance loss.

  3. Find some other new Python package to meet our needs.

TrentonBush commented 6 days ago

The only fuzzy matching addfips does is to replace 12 diacritics (like "é") and standardize 3 abbreviations like "Saint"/"St.". The code is like 10 lines. Your input data has to have perfect quality modulo those two replacements or the matches will fail.

The cardinal sin of addfips is to not maintain the source census data or handle the slight changes over time, which is the only thing that needs maintenance! But I'll stop there because I think I've written more lines complaining about addfips than there are lines of code in the entire package 😂