Fix expansion of `unit_id_pudl` to be compatible with annually varying `prime_mover_code`

cmgosnell commented 1 year ago

Describe the bug

A while back @zaneselvans developed another step in our pursuit of complete coverage of unit_id_pudl #1037. We didn't actually turn it on in our tests so we didn't notice it is not compatible with the change we made to how we are normalizing the prime_mover_code - we determined that the PM code should actually be annually varying so it now gets harvested into the annually varying generators_eia860 table instead of the static generators_entity_eia table.

This change is incompatible with the way the current pudl.output.eia860.fill_unit_ids works during the assign_single_gen_unit_ids step. This could very well effects other stages of the unit id assignment as well!

To Reproduce

Steps to reproduce the behavior -- ideally including a code snippet that causes the error to appear.

Is the bug related to the software / database? If so, please attach the settings.yml file you're using to specify which data to load, and make a note of where in the ETL process the error is happening.
Have you found an error or inconsistency in the data that PUDL brings together? If so, what is the data source, year, plant_id, etc -- how can we find the data you're looking at, and what is the nature of the error or inconsistency.
Does the problem happen when you're starting up the database (importing data) or does it happen later when you're trying to use the database?

With an all-year pudl.sqlite:

pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine,freq="AS",unit_ids=True)
gens = pudl_out.gens_eia860()

Error:

2023-04-19 11:51:16 [    INFO] catalystcoop.pudl.transform.eia861:582 Started with 81265 missing BA Codes out of 185553 records (43.80%)
2023-04-19 11:51:18 [    INFO] catalystcoop.pudl.transform.eia861:606 Ended with 14923 missing BA Codes out of 185553 records (8.04%)
2023-04-19 11:51:18 [    INFO] catalystcoop.pudl.output.eia860:182 91.6% of plant records have consistently reported BA Codes
2023-04-19 11:51:18 [    INFO] catalystcoop.pudl.output.eia860:232 Before any filling treatment has been applied. 43.8% of records have no BA codes
2023-04-19 11:51:19 [    INFO] catalystcoop.pudl.output.eia860:232 Backfilling and consistent value is the same. Filled w/ most consistent BA code. 10.8% of records have no BA codes
2023-04-19 11:51:19 [    INFO] catalystcoop.pudl.output.eia860:232 SWPP is most consistent value. Filled w/ oldest BA code. 8.3% of records have no BA codes
2023-04-19 11:51:19 [    INFO] catalystcoop.pudl.output.eia860:232 NWMT is most consistent value. Filled w/ oldest BA code. 8.3% of records have no BA codes
2023-04-19 11:51:19 [    INFO] catalystcoop.pudl.output.eia860:232 Two or more years of oldest BA code. Filled w/ oldest BA code. 8.1% of records have no BA codes
2023-04-19 11:51:19 [ WARNING] catalystcoop.pudl.output.eia860:31 pudl.output.eia860.utilities_eia860() will be deprecated in a future version of PUDL. In the future, call the PudlTabl.utils_eia860() method or pull the denorm_utilities_eia tabledirectly from the pudl.sqlite database.
2023-04-19 11:51:27 [    INFO] catalystcoop.pudl.output.eia860:510 Assigning pudl unit ids
2023-04-19 11:51:28 [    INFO] catalystcoop.pudl.output.eia860:1171 Selected 220100 ['CC', 'CS', 'GT', 'IC'] records lacking Unit IDs from 523563 records overall. 

---------------------------------------------------------------------------
MergeError                                Traceback (most recent call last)
Cell In[159], line 2
      1 pudl_out = pudl.output.pudltabl.PudlTabl(pudl_engine,freq="AS",unit_ids=True)
----> 2 gens = pudl_out.gens_eia860()

File ~/code/pudl/src/pudl/output/pudltabl.py:293, in PudlTabl.gens_eia860(self, update)
    280 """Pull a dataframe describing generators, as reported in EIA 860.
    281 
    282 If you want to fill the technology_description field, recreate
   (...)
    290     pandas.DataFrame: a denormalized table for interactive use.
    291 """
    292 if update or self._dfs["gens_eia860"] is None:
--> 293     self._dfs["gens_eia860"] = pudl.output.eia860.generators_eia860(
    294         self.pudl_engine,
    295         start_date=self.start_date,
    296         end_date=self.end_date,
    297         unit_ids=self.unit_ids,
    298         fill_tech_desc=self.fill_tech_desc,
    299     )
    301 return self._dfs["gens_eia860"]

File ~/code/pudl/src/pudl/output/eia860.py:511, in generators_eia860(pudl_engine, start_date, end_date, unit_ids, fill_tech_desc)
    509 if unit_ids:
    510     logger.info("Assigning pudl unit ids")
--> 511     out_df = assign_unit_ids(out_df)
    513 if fill_tech_desc:
    514     logger.info("Filling technology type")

File ~/code/pudl/src/pudl/output/eia860.py:900, in assign_unit_ids(gens_df)
    888     errstr = f"Input DataFrame missing required columns: {missing_cols}."
    889     raise ValueError(errstr)
    891 unit_ids = (
    892     gens_df.loc[:, required_cols]
    893     # Forward and back fill preexisting Unit IDs:
    894     .pipe(fill_unit_ids)
    895     # Assign Unit IDs to the CT+CA CC generators:
    896     .pipe(assign_cc_unit_ids)
    897     # For whole-combined cycle (CC) and single-shaft combined cycle (CS)
    898     # units, we give each generator their own unit ID. We do the same for
    899     # internal combustion and simple-cycle gas combustion turbines.
--> 900     .pipe(assign_single_gen_unit_ids, prime_mover_codes=["CC", "CS", "GT", "IC"])
    901     # Nuclear units don't report in boiler_fuel_eia923 or generation_eia923
    902     # Their fuel consumption is reported as mmbtu in generation_fuel_eia923
    903     # Their net generation also only shows up in generation_fuel_eia923
    904     # The generation_fuel_eia923 table records a "nuclear_unit_id" which
    905     # appears to be the same as the associated generator_id. However, we
    906     # can't use that as a unit_id_pudl since it might have a collision with
    907     # other already assigned unit_id_pudl values in the same plant for
    908     # generators with other fuel types. Thus we still need to assign them
    909     # a fuel-and-prime-mover based unit ID here. For now ALL nuclear plants
    910     # use steam turbines.
    911     .pipe(
    912         assign_single_gen_unit_ids,
    913         prime_mover_codes=["ST"],
    914         fuel_type_code_pudl="nuclear",
    915         label_prefix="nuclear",
    916     )
    917     # In these next 4 assignments, we lump together all steam turbine (ST)
    918     # generators that have a consistent simplified fuel_type_code_pudl
    919     # across all years within a given plant into the same unit, since we
    920     # won't be able to distinguish them in the generation_fuel_eia923
    921     # table. This will lump together solid fuels like BIT, LIG, SUB, PC etc.
    922     # under "coal".  There are a few cases in which a generator has truly
    923     # changed its fuel type, e.g. coal-to-gas conversions but these are
    924     # rare and insubstantial. They will not be assigned a Unit ID in this
    925     # process. Non-fuel steam generation is also left out (geothermal &
    926     # solar thermal)
    927     .pipe(
    928         assign_prime_fuel_unit_ids,
    929         prime_mover_code="ST",
    930         fuel_type_code_pudl="coal",
    931     )
    932     .pipe(
    933         assign_prime_fuel_unit_ids, prime_mover_code="ST", fuel_type_code_pudl="oil"
    934     )
    935     .pipe(
    936         assign_prime_fuel_unit_ids, prime_mover_code="ST", fuel_type_code_pudl="gas"
    937     )
    938     .pipe(
    939         assign_prime_fuel_unit_ids,
    940         prime_mover_code="ST",
    941         fuel_type_code_pudl="waste",
    942     )
    943     # Retain only the merge keys and output columns
    944     .loc[
    945         :,
    946         [
    947             "plant_id_eia",  # Merge key
    948             "generator_id",  # Merge key
    949             "report_date",  # Merge key
    950             "unit_id_pudl",  # Output column
    951             "bga_source",  # Output column
    952         ],
    953     ]
    954 )
    955 # Check that each generator is only ever associated with a single unit,
    956 # at least within the codes that we've just assigned -- the Unit IDs that
    957 # are based on the EIA boiler-generator-association or other matching
    958 # methods could legitimately specify different units for generators over
    959 # time -- which could impact the forward-/back-filled IDs as well:
    960 old_codes = list(gens_df.bga_source.unique()) + ["bfill_units", "ffill_units"]

File ~/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/generic.py:5839, in NDFrame.pipe(self, func, *args, **kwargs)
   5781 @final
   5782 @doc(klass=_shared_doc_kwargs["klass"])
   5783 def pipe(
   (...)
   5787     **kwargs,
   5788 ) -> T:
   5789     r"""
   5790     Apply chainable functions that expect Series or DataFrames.
   5791 
   (...)
   5837     ...  )  # doctest: +SKIP
   5838     """
-> 5839     return com.pipe(self, func, *args, **kwargs)

File ~/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/common.py:513, in pipe(obj, func, *args, **kwargs)
    511     return func(*args, **kwargs)
    512 else:
--> 513     return func(obj, *args, **kwargs)

File ~/code/pudl/src/pudl/output/eia860.py:1200, in assign_single_gen_unit_ids(gens_df, prime_mover_codes, fuel_type_code_pudl, label_prefix)
   1178 unit_ids = (
   1179     gens_df.loc[row_mask, cols]
   1180     .drop_duplicates()
   (...)
   1196     .drop(["max_unit_id_pudl", "prime_mover_code"], axis="columns")
   1197 )
   1198 # Split original dataframe based on row_mask, and merge in the new IDs and
   1199 # labels only on the subset of the dataframe matching our row_mask:
-> 1200 return _append_masked_units(
   1201     gens_df, row_mask, unit_ids, on=["plant_id_eia", "generator_id"]
   1202 )

File ~/code/pudl/src/pudl/output/eia860.py:1112, in _append_masked_units(gens_df, row_mask, unit_ids, on)
   1087 def _append_masked_units(gens_df, row_mask, unit_ids, on):
   1088     """Replace rows with new PUDL Unit IDs in the original dataframe.
   1089 
   1090     Merges the newly assigned Unit IDs found in ``unit_ids`` into the
   (...)
   1107         pandas.DataFrame:
   1108     """
   1109     return gens_df.loc[~row_mask].append(
   1110         gens_df.loc[row_mask]
   1111         .drop(["unit_id_pudl", "bga_source"], axis="columns")
-> 1112         .merge(
   1113             unit_ids,
   1114             on=on,
   1115             how="left",
   1116             validate="many_to_one",
   1117         )
   1118     )

File ~/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/frame.py:10093, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
  10074 @Substitution("")
  10075 @Appender(_merge_doc, indents=2)
  10076 def merge(
   (...)
  10089     validate: str | None = None,
  10090 ) -> DataFrame:
  10091     from pandas.core.reshape.merge import merge
> 10093     return merge(
  10094         self,
  10095         right,
  10096         how=how,
  10097         on=on,
  10098         left_on=left_on,
  10099         right_on=right_on,
  10100         left_index=left_index,
  10101         right_index=right_index,
  10102         sort=sort,
  10103         suffixes=suffixes,
  10104         copy=copy,
  10105         indicator=indicator,
  10106         validate=validate,
  10107     )

File ~/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/reshape/merge.py:110, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     93 @Substitution("\nleft : DataFrame or named Series")
     94 @Appender(_merge_doc, indents=0)
     95 def merge(
   (...)
    108     validate: str | None = None,
    109 ) -> DataFrame:
--> 110     op = _MergeOperation(
    111         left,
    112         right,
    113         how=how,
    114         on=on,
    115         left_on=left_on,
    116         right_on=right_on,
    117         left_index=left_index,
    118         right_index=right_index,
    119         sort=sort,
    120         suffixes=suffixes,
    121         indicator=indicator,
    122         validate=validate,
    123     )
    124     return op.get_result(copy=copy)

File ~/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/reshape/merge.py:713, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate)
    709 # If argument passed to validate,
    710 # check if columns specified as unique
    711 # are in fact unique.
    712 if validate is not None:
--> 713     self._validate(validate)

File ~/mambaforge/envs/pudl-dev/lib/python3.11/site-packages/pandas/core/reshape/merge.py:1531, in _MergeOperation._validate(self, validate)
   1529 elif validate in ["many_to_one", "m:1"]:
   1530     if not right_unique:
-> 1531         raise MergeError(
   1532             "Merge keys are not unique in right dataset; "
   1533             "not a many-to-one merge"
   1534         )
   1536 elif validate in ["many_to_many", "m:m"]:
   1537     pass

MergeError: Merge keys are not unique in right dataset; not a many-to-one merge

I did some debugging sleuthing and found that the unit_ids that were trying to be merged into gens_df in _append_masked_units had multiple proposed units for a small number of generators due to having multiple PMs over time:

Expected behavior

Preferably, we could run this function to make lotsa unit_id_pudl's 😎 This will be helpful in a number of ways but in particular for making the subplant_id to glue epacamd and eia (see #2491)

aesharpe commented 12 months ago

I did some debugging sleuthing and found that the unit_ids that were trying to be merged into gens_df in _append_masked_units had multiple proposed units for a small number of generators due to having multiple PMs over time

@cmgosnell is the problem here that these should be merged into the non-static gens table vs. the entity generators table? Or is the problem that unit ids should not vary by generator at all?

cmgosnell commented 11 months ago

When the unit id code was written, we expected PM's to be static. We have since transitioned PM's to be an annually varying generator attribute. Which just breaks the expectations built into this whole step.

I honestly don't know what the answer is here. It is possible that we have to enable the unit id to be varied annually. Or we could ignore these PM-time-varying generators in this process bc >95% of them don't vary. Or we need to re-jigger the unit id code to enable time-varying PM codes while keeping unit ids static.

If this were me doing this, I'd take a minute to go do a survey of a handful of time-varying PM plants, convert those plants into little unit tests & start building an understanding of which of the various endpoints we should be shooting for.

catalyst-cooperative / pudl