Closed cmgosnell closed 2 years ago
I think this comment is best directed to @cmgosnell: After some poking I believe the reason why there is relatively low coverage for installation year lies in how unique installation years are chosen versus how the merge onto the plant part dataframe is done.
install = (
gens_mega.assign(installation_year=lambda x: x.operating_date.dt.year)
.astype({"installation_year": "Int64"})[
self.id_cols + IDX_TO_ADD + ["installation_year"]
]
.sort_values("installation_year", ascending=False)
.drop_duplicates(subset=self.id_cols, keep="first")
.dropna(subset=self.id_cols)
)
To get the dataframe of installation years, operating date is sorted from most recent to oldest and then the most recent year in a group of uniquely identified generators is chosen (this is the drop_duplicates
call). Here, the id keys are just id_cols
for the PlantPart
instance which are given by the PLANT_PARTS
global.
part_df = part_df.merge(
install, how="left", on=self.id_cols + IDX_TO_ADD, validate="m:1"
)
Later, when these installation years are merged back onto the plant part dataframe, the merge is done on self.id_cols
as well as IDX_TO_ADD
(which is report_date
and operational_status_pudl
). Since the installation years were segmented by just id_cols
, the inclusion ofreport_date
and operational_status_pudl
in the merge filters out most entries in the plant parts list from a matching installation year.
My initial take on what should happen:
report_date
. Shouldn't the installation year be the same for any grouping of plant parts no matter the report_date
?operational_status_pudl
should happen when creating the installation year dataframe. Maybe? I found the following note about installation_year
a little confusing:"Year from the operating_date (when a generator went into service). This column attempts to emulate the reporting of the installation_year in FERC1 (which is either the year the oldest still operating unit went into service, or the year the most recently installed unit went into service). BOTH of those FERC columns could potentially be emulated using the information available on the EIA side."
Currently in the EIA plant-parts list/master unit list, there is a semi-low percentage of records with install years despite 97% of the generators having install years. Track down why this discrepancy and correct it to increase coverage.
The method which generates the
installation_year
column is... drumroll please...PlantPart.add_install_year()
idk why this is broken but I actually think it would be better to have a new version of an
AddAttribute
class which grabs the first or last of a sorted column. We have a version of this withAddPriorityAttribute
but this is for categorical column. In order to mirror FERC'sinstallation_year
and theconstruction_year
, I think we want the max and mixoperating_date.dt.year