catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Use multi-year reporting in single report year to check values #2164

Open aesharpe opened 1 year ago

aesharpe commented 1 year ago

There are some FERC tables, like the f1_elc_op_mnt_expn table, that report two years of data in one table. There is a "current year" value, and a "previous year" value.

The current transformations remove the data from non-report year years. However, that data may be interesting to compare against previous years. Is the "previous year" data the same as the "current year" data from last year?

We should build a check that not only retains this data but uses it to check other years.

aesharpe commented 1 year ago

This isn't necessarily the best way to go about this, but here are some of my thoughts pertaining to the f1_elc_op_mnt_expn table.

We built the unstack_balance_from_report_year function to handle tables with starting balance and ending balance values reported in different rows. Here we're dealing with the same concept except with "current year" and "previous year". If we modify the unstack_balance_from_report_year to work with both instant and duration tables, it could be a good starting place for A) retaining both the current and previous year values and B) using the previous year values to compare against last years' data.

Here is a suggested modification of unstack_balance_from_report_year:

class UnstackValuesToReportYear(TransformParams):
    """Parameters for :func:`unstack_values_to_report_year`."""

    date_col: str
    """The name of the column used to extract year from and compare to report year."""

    unstack_label_older_year: str
    """The name of the label given to values from the year before the report year."""

    unstack_label_newer_year: str
    """The name of the label given to values from the report year."""

class UnstackValuesToReportYearXBRL(TransformParams):
    """Parameters for unstacking values for eiter instant or duration table."""

    xbrl_instant: UnstackValuesToReportYear()
    xbrl_duration: UnstackValuesToReportYear()

def unstack_values_to_report_year(
    df: pd.DataFrame, params: UnstackValuesToReportYear
) -> pd.DataFrame:
    """Turn start year end year rows into columns for each value type.

    This function is utilized in :func:`process_instant_xbrl`.

    There are some instant tables that report the start-of-year data and the
    end-of-year data on seperate rows. The dbf version of the table has a column for the
    starting data, a column for the ending data, and a row number that cooresponds with
    the row literal that data represents: i.e., cost, etc.

    This function unstacks that table and adds the suffixes of your choice to each of the columns. These may then be used as
    ``value_types`` in the :func:`wide_to_tody` function to normalize the table.

    There are two checks in place:

    First, it will make sure that there are not multiple entries per year for the same
    entiity_id. Ex: a row for 2020-12-31 and 2020-06-30 for entitiy_id X means that
    there's more data here than is getting reported. We could just drop these mid-year
    values, but we might want to keep them or at least check that there is no funny
    business with the data.

    It will also check that there are no mid-year dates period. If an entity reports
    a value from the middle of the year, there is no telling whether that actually
    represents the start or end balance and requires a closer look.
    """
    df["year"] = pd.to_datetime(df[params.date_col]).dt.year
    if df.duplicated(["entity_id", "year"]).any():
        raise AssertionError(
            "Looks like there are multiple entries per year--not sure which to use "
            "for the start/end balance."
        )
    if not pd.to_datetime(df[params.date_col]).dt.is_year_end.all():
        raise AssertionError(
            "Looks like there are some values in here that aren't from the end of "
            "the year. We can't use those to calculate start and end balances."
        )
    df.loc[
        df.report_year == (df.year + 1), "value_type"
    ] = params.unstack_label_older_year
    df.loc[df.report_year == df.year, "value_type"] = params.unstack_label_newer_year
    if not df.value_type.notna().all():
        # Remove rows from years that are not representative of start/end dates
        # for a given report year (i.e., the report year and one year prior).
        logger.warning(
            f"Dropping unexpected years: "
            f"{df.loc[df.value_type.isna(), 'year'].unique()}"
        )
        df = df[df["value_type"].notna()].copy()
    df = (
        df.drop(["year", params.date_col], axis="columns")
        .set_index(["entity_id", "report_year", "value_type"])
        .unstack("value_type")
    )
    # This turns a multi-index into a single-level index with tuples of strings
    # as the keys, and then converts the tuples of strings into a single string
    # by joining their values with an underscore. This results in column labels
    # like boiler_plant_equipment_steam_production_starting_balance
    df.columns = ["_".join(items) for items in df.columns.to_flat_index()]
    df = df.reset_index()
    return df
aesharpe commented 1 year ago

I also used the following code to check the previous year values against last year's previous values without altering the structure of the table:

processed_dbf = processed_dbf.sort_values(
     ["utility_id_ferc1", "start_date", "report_year"]
)

group = processed_dbf.groupby(
     ["utility_id_ferc1", "oandm_expense_type"]
)

processed_dbf["prev_match_current_year_values"] = group.apply(
     lambda x: (
          x.prev_yr_amt.shift(-1) == x.annual_oandm_expense
     )
)

processed_dbf["last_year"] = group.apply(
     lambda x: (x.report_year.max()==x.report_year)
)

We don't expect any of the most recent ("last year") values to have any values to compare against (because there is no previous year that equals the most recent year of data!) , so we'll want to filter those out when comparing previous/current year values.

jrea-rmi commented 1 year ago

using previous year values to supplement and correct current year values would certainly be nice. The instances when the previous year was not a reporting year, it adds an extra year of data, which is mostly valuable for new respondents (over getting some 1993 data). Correcting data for sign errors would also be nice. And I'd rather have the extended/corrected data in a single column for the output tables in general, knowing that we can reference the raw data tables to check specific errors we're seeing.

previous/current year seems the same as starting/end of year balances to me.