catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

increase coverage of installation year (operating_date) in EIA plant parts list #1282

Closed cmgosnell closed 2 years ago

cmgosnell commented 2 years ago

Currently in the EIA plant-parts list/master unit list, there is a semi-low percentage of records with install years despite 97% of the generators having install years. Track down why this discrepancy and correct it to increase coverage.

The method which generates the installation_year column is... drumroll please... PlantPart.add_install_year()

idk why this is broken but I actually think it would be better to have a new version of an AddAttribute class which grabs the first or last of a sorted column. We have a version of this with AddPriorityAttribute but this is for categorical column. In order to mirror FERC's installation_year and the construction_year, I think we want the max and mix operating_date.dt.year

katie-lamb commented 2 years ago

I think this comment is best directed to @cmgosnell: After some poking I believe the reason why there is relatively low coverage for installation year lies in how unique installation years are chosen versus how the merge onto the plant part dataframe is done.

install = (
            gens_mega.assign(installation_year=lambda x: x.operating_date.dt.year)
            .astype({"installation_year": "Int64"})[
                self.id_cols + IDX_TO_ADD + ["installation_year"]
            ]
            .sort_values("installation_year", ascending=False)
            .drop_duplicates(subset=self.id_cols, keep="first")
            .dropna(subset=self.id_cols)
        )

To get the dataframe of installation years, operating date is sorted from most recent to oldest and then the most recent year in a group of uniquely identified generators is chosen (this is the drop_duplicates call). Here, the id keys are just id_cols for the PlantPart instance which are given by the PLANT_PARTS global.

part_df = part_df.merge(
            install, how="left", on=self.id_cols + IDX_TO_ADD, validate="m:1"
        )

Later, when these installation years are merged back onto the plant part dataframe, the merge is done on self.id_cols as well as IDX_TO_ADD (which is report_date and operational_status_pudl). Since the installation years were segmented by just id_cols, the inclusion ofreport_date and operational_status_pudl in the merge filters out most entries in the plant parts list from a matching installation year.

My initial take on what should happen:

"Year from the operating_date (when a generator went into service). This column attempts to emulate the reporting of the installation_year in FERC1 (which is either the year the oldest still operating unit went into service, or the year the most recently installed unit went into service). BOTH of those FERC columns could potentially be emulated using the information available on the EIA side."