`pudl_out` not working to load EIA tables for single years prior to 2013

grgmiller commented 1 year ago

In our OGE pipeline, we are trying to load EIA data for years prior to 2013 by running pudl_out.gens_eia860() and pudl_out.gf_eia923(). However, when we run this, we are getting the following KeyErrors.

It looks like this issue is originating in fill_in_missing_ba_codes, which notes that :

 Balancing authority codes did not begin being reported until 2013. This function
    fills in the old years with BA codes using two main methods:

    * Backfilling with the oldest reported BA code for each plant.
    * Backfilling with the most frequently reported BA code for each plant.

I am suspecting that this is caused because we are trying to load data for a single year (2012) rather than running pudl_out with a range of years that includes multiple years after 2012 to backfill. I recognize that often you're running the pudl pipeline for multiple years instead of a single year, but is there any way to make this work for a single year prior to 2013, or is the only solution to load newer data as well?

How many years of new data would we need to load? Just 2013? All years 2013-2021? Since pudl_out takes a start date and end date argument, if we just wanted to load data for 2005, does this mean that in order for this function to work, we would actually have to load all data for 2005-2013+, even though we are only interested in a single year?

Trying to load EIA-860 data for 2012:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 gens = pudl_out.gens_eia860()
      2 gens

File a:\miniconda3\envs\open_grid_emissions\lib\site-packages\pudl\output\pudltabl.py:602, in PudlTabl.gens_eia860(self, update)
    589 """Pull a dataframe describing generators, as reported in EIA 860.
    590 
    591 If you want to fill the technology_description field, recreate
   (...)
    599     pandas.DataFrame: a denormalized table for interactive use.
    600 """
    601 if update or self._dfs["gens_eia860"] is None:
--> 602     self._dfs["gens_eia860"] = pudl.output.eia860.generators_eia860(
    603         self.pudl_engine,
    604         start_date=self.start_date,
    605         end_date=self.end_date,
    606         unit_ids=self.unit_ids,
    607         fill_tech_desc=self.fill_tech_desc,
    608     )
    610 return self._dfs["gens_eia860"]

File a:\miniconda3\envs\open_grid_emissions\lib\site-packages\pudl\output\eia860.py:459, in generators_eia860(pudl_engine, start_date, end_date, unit_ids, fill_tech_desc)
    456 out_df.report_date = pd.to_datetime(out_df.report_date)
    458 # Bring in some generic plant & utility information:
--> 459 pu_eia = plants_utils_eia860(
    460     pudl_engine, start_date=start_date, end_date=end_date
    461 ).drop(["plant_name_eia", "utility_id_eia"], axis="columns")
    462 out_df = pd.merge(out_df, pu_eia, on=["report_date", "plant_id_eia"], how="left")
    464 # Merge in the unit_id_pudl assigned to each generator in the BGA process
    465 # Pull the BGA table and make it unit-generator only:

File a:\miniconda3\envs\open_grid_emissions\lib\site-packages\pudl\output\eia860.py:338, in plants_utils_eia860(pudl_engine, start_date, end_date)
    311 """Create a dataframe of plant and utility IDs and names from EIA 860.
    312 
    313 Returns a pandas dataframe with the following columns:
   (...)
    334     names from EIA 860.
    335 """
    336 # Contains the one-to-one mapping of EIA plants to their operators
    337 plants_eia = (
--> 338     plants_eia860(pudl_engine, start_date=start_date, end_date=end_date)
    339     .drop(
    340         [
    341             "utility_id_pudl",
    342             "city",
    343             "state",  # Avoid dupes in merge
    344             "zip_code",
    345             "street_address",
    346             "utility_name_eia",
    347         ],
    348         axis="columns",
    349     )
    350     .dropna(subset=["utility_id_eia"])  # Drop unmergable records
    351 )
    352 utils_eia = utilities_eia860(pudl_engine, start_date=start_date, end_date=end_date)
    354 # to avoid duplicate columns on the merge...

File a:\miniconda3\envs\open_grid_emissions\lib\site-packages\pudl\output\eia860.py:136, in plants_eia860(pudl_engine, start_date, end_date)
    128 utils_eia_select = sa.sql.select(utils_eia_tbl)
    129 utils_eia_df = pd.read_sql(utils_eia_select, pudl_engine)
    131 out_df = (
    132     pd.merge(plants_eia_df, plants_eia860_df, how="left", on=["plant_id_eia"])
    133     .merge(plants_g_eia_df, how="left", on=["plant_id_eia"])
    134     .merge(utils_eia_df, how="left", on=["utility_id_eia"])
    135     .dropna(subset=["report_date", "plant_id_eia"])
--> 136     .pipe(fill_in_missing_ba_codes)
    137     .pipe(apply_pudl_dtypes, group="eia")
    138 )
    139 return out_df

File a:\miniconda3\envs\open_grid_emissions\lib\site-packages\pandas\core\generic.py:5839, in NDFrame.pipe(self, func, *args, **kwargs)
   5781 @final
   5782 @doc(klass=_shared_doc_kwargs["klass"])
   5783 def pipe(
   (...)
   5787     **kwargs,
   5788 ) -> T:
   5789     r"""
   5790     Apply chainable functions that expect Series or DataFrames.
   5791 
   (...)
   5837     ...  )  # doctest: +SKIP
   5838     """
-> 5839     return com.pipe(self, func, *args, **kwargs)

File a:\miniconda3\envs\open_grid_emissions\lib\site-packages\pandas\core\common.py:503, in pipe(obj, func, *args, **kwargs)
    501     return func(*args, **kwargs)
    502 else:
--> 503     return func(obj, *args, **kwargs)

File a:\miniconda3\envs\open_grid_emissions\lib\site-packages\pudl\output\eia860.py:232, in fill_in_missing_ba_codes(plants)
    227     logger.info(
    228         f"{method_str}. {currently_null_len/len(plants):.1%} of records have no BA codes"
...
-> 6173     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6175 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6176 raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index(['plant_id_eia', 'balancing_authority_code_eia', 'balancing_authority_code_eia_consistent_rate'], dtype='object')] are in the [columns]"

zaneselvans commented 1 year ago

We rarely use only a subset of the years, or if we do it's typically the most recent year (for integration testing), and the data that's required to do the value filling is only present in the latter years, so I would not expect this to work -- but we should have a warning or a useful error message indicating what's happening.

However, we (primarily @bendnorman see #1973) are in the process of refactoring the ETL process to load all of the output tables directly into the database so that users don't need to bother with all of the software that's currently layered on top of the base PUDL DB. This will include all of the backfilled / imputed / calculated values, and when that's done you would be able to simply select the range of dates you want from that table of derived values.

In the meantime you'll need to include at least one year of data that contains the values being used to backfill. The more years with data you include, the more information the filling process will have to work with. These processes also assume a contiguous block of years, so yes if you wanted just 2005, you'd need at least 2005-2013.

grgmiller commented 1 year ago

Thanks - this is helpful. For now we'll try loading with multiple years. Any idea on the approximate timeline for this refactored ETL process to be complete?

zaneselvans commented 1 year ago

The goal is to switch the base DB ETL over to using Dagster by the end of January (see #2104, @bendnorman will have the best of idea of the expected delivery date), and then get all of the derived values / imputations / denormalized tables integrated by the end of Q1, preserving the deprecated pudl_out interface for backward compatibility (but having it just read the new tables from the DB) for a while before going to just distributing the database without the ginormous python package that runs the ETL.

bendnorman commented 1 year ago

@zaneselvans description of the expected timeline is accurate. We are planning on structuring the refactor so output tables can be converted one at a time. We can prioritize converting the gf_eia923 table so you won't run into this bug.

grgmiller commented 1 year ago

Thanks! We primarily work with gens_eia860, plants_eia860, gf_eia923, bf_eia923, gen_original_eia923, and bga_eia860

bendnorman commented 1 year ago

Sounds good! We'll try to tackle those tables first.

zaneselvans commented 1 year ago

Shall we close this since there's a broader fix in the works?

catalyst-cooperative / pudl

`pudl_out` not working to load EIA tables for single years prior to 2013 #2269