catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
458 stars 105 forks source link

Clarify logging outputs for table transforms #2165

Open aesharpe opened 1 year ago

aesharpe commented 1 year ago

The transform process uses classes, such as AbstractTableTransformer and Ferc1AbstractTableTransformer, coupled with a standard set of method calls to modify tables. Whether the methods alter the table in question depends on some default parameter settings and the parameters specified in params.py.

Right now, the generic transformation functions defined in classes.py look like this:

def func_name():
    if params is None:
        params = self.params.drop_invalid_rows
    logger.info(f"{self.table_id.value}: Dropping remaining invalid rows.")

Whereas the ones in the transform/ferc1.py module look like this:

def func_name():
    if params is None:
        params = self.params.drop_duplicate_rows_dbf
    if params.table_name:
        logger.info(
            f"{self.table_id.value}: Dropping rows where primary key and data "
            "columns are duplicated."
        )

The extra conditional before the logging output ensures that the logger is only printed if there are indeed valid parameters for that function. This is good because it avoid confusion as to whether the transformation was applied to a given table or not.

Other things to consider:

zaneselvans commented 1 year ago

Logging that is done by the function can't know what class it's inside of / table it's operating on, since there is no self to refer to.

Logging that is done in the class method that wraps the function won't have the context required to report transform-specific changes / progress / status.

It seems like we should have:

I'm suspicious that there is not a general way to do the second conditional you're suggesting though. Once params has been set:

params = self.params.drop_duplicate_rows_dbf

How do you determine whether a substantive transformation is going to take place or not, based on the value of the params? It seems like that will be different for every parameter class.

aesharpe commented 1 year ago

I'm suspicious that there is not a general way to do the second conditional you're suggesting though. Once params has been set

What would be the reasoning against doing something like:

def drop_duplicate_rows_dbf(
        self, df: pd.DataFrame, params: DropDuplicateRowsDbf | None = None
) -> pd.DataFrame:
        if params is None:
               params = self.params.drop_duplicate_rows_dbf
        if params:
               logger.info(
                    f"{self.table_id.value}: Dropping rows where primary key and " 
                     "data columns are duplicated."
               )
               df = drop_duplicate_rows_dbf(df, params=params)
        return df

Right now it says if params.table_name: but I think what's important is whether or not there are any params to begin with because that's what determines whether the function runs. Especially when coupled with #2160 about flagging invalid params.

How do you determine whether a substantive transformation is going to take place or not, based on the value of the params? It seems like that will be different for every parameter class.

I think the way to approach this is to include the function-specific logging messages that don't include self.table_id.value so for drop duplicate rows, just make sure it tells you how many rows it's dropping (even if it's 0). I think it might already do that, actually (some of the functions already have these implemented). I mostly just want to know if the function actually ran at all vs. we asked it to run, found out there weren't any params, and then didn't run it at all.

zaneselvans commented 1 year ago

I think the problem here is that params will always be set when you get to the second if statement. Either was set at the beginning, or it has been set by in the assignment within the first if statement.

self.params is a Ferc1TableTransformParams object, which is composed of many other TransformParams objects, some of which are specific to FERC Form 1, and some of which are more general and inherited from the parente TableTransformParams class. Here's the definition:

class Ferc1TableTransformParams(TableTransformParams):
    """A model defining what TransformParams are allowed for FERC Form 1.

    This adds additional parameter models beyond the ones inherited from the
    :class:`pudl.transform.classes.AbstractTableTransformer` class.
    """

    class Config:
        """Only allow the known table transform params."""

        extra = "forbid"

    rename_columns_ferc1: RenameColumnsFerc1 = RenameColumnsFerc1(
        dbf=RenameColumns(),
        xbrl=RenameColumns(),
        xbrl_instant=RenameColumns(),
        xbrl_duration=RenameColumns(),
    )
    wide_to_tidy: WideToTidySourceFerc1 = WideToTidySourceFerc1(
        dbf=WideToTidy(), xbrl=WideToTidy()
    )
    merge_xbrl_metadata: MergeXbrlMetadata = MergeXbrlMetadata()
    align_row_numbers_dbf: AlignRowNumbersDbf = AlignRowNumbersDbf()
    drop_duplicate_rows_dbf: DropDuplicateRowsDbf = DropDuplicateRowsDbf()
    unstack_balances_to_report_year_instant_xbrl: UnstackBalancesToReportYearInstantXbrl = (
        UnstackBalancesToReportYearInstantXbrl()
    )

All of the attributes have default values of the appropriate type, including drop_duplicate_rows_dbf so even if the parameter dictionary in pudl.transform.params.ferc1 for the table being transformed does not provide values for drop_duplicate_rows_dbf, when the Ferc1TableTransformParams class is instantiated, that attribute will be set to the default value. E.g.

f1ttfp = pudl.transform.ferc1.Ferc1TableTransformParams()
if f1ttfp.drop_duplicate_rows_dbf:
    print("See, it's got a value")