catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Provide feedback for RMI EPA-EIA crosswalk updates #2371

Open zschira opened 1 year ago

zschira commented 1 year ago

Background

RMI has been working on updates to the EPA crosswalk notebook here. I'm collecting some initial thoughts on what they should do to prep this for a PR back into the original EPA repo. This should probably be moved somewhere else, but putting it here for now.

Tasks before PR

Overall looks quite manageable to get this ready for a PR. Many changes are fairly minor, and don't need much work.

Cleanup

Possible problems

Documentation

Needs from RMI

Many of the above tasks can be handled by Catalyst, but it would be helpful to get some inputs from RMI.

arengel commented 1 year ago

Thanks @zschira for taking a look at our changes to the crosswalk and laying out these next steps!

I think these all make sense in terms of getting to a PR. The other part of this is whether a PR makes sense for some or all of the changes. Here, a response to your first question I think would help us have the required conversation with the maintainers of the upstream repo.

A high-level overview of the changes

  1. Update data sources There are broadly two parts of this, (a) bump the year of the Annual 860 data used to the most recent year (and add the ability to use Early Release data), and (b) use Monthly 860 as the source for generator data. I think (a) would be relatively uncontroversial, with the possible exception of using Early Release data. On (b), I think the logic here is two-fold, one is that it allows adding matches for new plants more quickly, but also that the monthly generator data is more complete for retired generators.
  2. Create a multi-year crosswalk Here we bring in many years of 860 plant characteristic data and join it to an expanded crosswalk. Here we want to link CAMD units to capacity and fuel types by year, since those data can change from year to year. That then allows us to assign each CAMD unit a plant, prime move and then one or more fuels. It also supported the allocation of CAMD units to EIA generators.
  3. Usability tweaks Things like caching CAMD data to avoid multiple downloads when making adjustments.
  4. Removal of optional data I'm not sure why this was removed.

Pushing upstream

I think (1) makes a lot of sense for a PR, there are some finer points around use of Early Release data and the switch to monthly generator data but I think we can work out something mutually agreeable on both.

I'm more ambivalent about (2). Some of the logic of the multi-year crosswalk has diminished as we no longer attempt to allocate CAMD units to EIA generators, and we have better downstream methods for determining the prime mover and fuel codes to associate with CAMD units. This points to the fact that this functionality is separable from the matching logic of the crosswalk. Unless we believe that others would be interested in it, I'm not sure it's worth the added effort of putting it into a PR. Also personally, I would rather have it in Python, so would argue for any version of this we do want to live in PUDL or elsewhere, and so is maybe better part of the broader CEMS crosswalk work we've been discussing with Catalyst and OGE.

(3) is minor and fine to include. I would probably not include (4) in the PR. One other category of changes that we should revert are changes to column names.

Other questions

  1. I think in cases where the commented out code is functionality we are replacing with a different process, that should be deleted. In other cases it depends but generally would want to delete or restore commented out code.
  2. I think the process for manual matching does not need to change, especially if we don't try to push the multi-year changes upstream. Though we should probably update the manual matches as best we can for the newer EIA data.
aesharpe commented 1 year ago

Putting this on hold in favor of updating the crosswalk to use 2021 EIA data instead of 2018.