Provide feedback for RMI EPA-EIA crosswalk updates

zschira commented 1 year ago

Background

RMI has been working on updates to the EPA crosswalk notebook here. I'm collecting some initial thoughts on what they should do to prep this for a PR back into the original EPA repo. This should probably be moved somewhere else, but putting it here for now.

Tasks before PR

Overall looks quite manageable to get this ready for a PR. Many changes are fairly minor, and don't need much work.

Cleanup

[ ] Fix formatting. There's some small spacing stylistic things that are a bit off. The original repo uses styling style guide, which can be used to fix these issues.
[ ] Switch variables with strings "Yes" and "No" to boolean flags
[ ] Improve variable names. There are some ambiguous variables like eia_860_year_num, eia_860_year_file_name, and eia_860_year, which are all very similar in name/value, which can be confusing.
[ ] Remove API key
[ ] Replace & with &&. & is vectorized and can create confusion, && is more appropriate in most cases
[ ] There are some places with confusing nested if statements that would be helpful with some comments
[ ] Commented out code should be removed or explained

Possible problems

[ ] Address eia_data_file_hist. This variable is assigned inside an if statement, but used outside of it

Documentation

[ ] Update readme
- [ ] Update methodology section describing new process
- [ ] Update outputs section
[ ] Update markdown blocks
- [ ] Update the "Import EIA data section
- [ ] Update step 3 of analysis section

Needs from RMI

Many of the above tasks can be handled by Catalyst, but it would be helpful to get some inputs from RMI.

A high level overview of the process changes will make it easier to parse some of the logic and update docs.
Should commented out code be removed entirely?
Will the manual matching process change with multiple years?

arengel commented 1 year ago

Thanks @zschira for taking a look at our changes to the crosswalk and laying out these next steps!

I think these all make sense in terms of getting to a PR. The other part of this is whether a PR makes sense for some or all of the changes. Here, a response to your first question I think would help us have the required conversation with the maintainers of the upstream repo.

A high-level overview of the changes

Update data sources There are broadly two parts of this, (a) bump the year of the Annual 860 data used to the most recent year (and add the ability to use Early Release data), and (b) use Monthly 860 as the source for generator data. I think (a) would be relatively uncontroversial, with the possible exception of using Early Release data. On (b), I think the logic here is two-fold, one is that it allows adding matches for new plants more quickly, but also that the monthly generator data is more complete for retired generators.
Create a multi-year crosswalk Here we bring in many years of 860 plant characteristic data and join it to an expanded crosswalk. Here we want to link CAMD units to capacity and fuel types by year, since those data can change from year to year. That then allows us to assign each CAMD unit a plant, prime move and then one or more fuels. It also supported the allocation of CAMD units to EIA generators.
Usability tweaks Things like caching CAMD data to avoid multiple downloads when making adjustments.
Removal of optional data I'm not sure why this was removed.

Pushing upstream

I think (1) makes a lot of sense for a PR, there are some finer points around use of Early Release data and the switch to monthly generator data but I think we can work out something mutually agreeable on both.

I'm more ambivalent about (2). Some of the logic of the multi-year crosswalk has diminished as we no longer attempt to allocate CAMD units to EIA generators, and we have better downstream methods for determining the prime mover and fuel codes to associate with CAMD units. This points to the fact that this functionality is separable from the matching logic of the crosswalk. Unless we believe that others would be interested in it, I'm not sure it's worth the added effort of putting it into a PR. Also personally, I would rather have it in Python, so would argue for any version of this we do want to live in PUDL or elsewhere, and so is maybe better part of the broader CEMS crosswalk work we've been discussing with Catalyst and OGE.

(3) is minor and fine to include. I would probably not include (4) in the PR. One other category of changes that we should revert are changes to column names.

catalyst-cooperative / pudl