Investigate running the EPA-EIA crosswalk on 2021 data

zschira commented 1 year ago

Background

By default the EPA-EIA crosswalk uses 2018 EIA860 data to generate matches. The notebook provided by EPA to generate these matches provides a configurable option for the year used, but there are some minor considerations to consider when trying to run with 2021 data (the most recent year of available 860 data).

Fix Download URL

The first and most simple is a slight URL change. The links for older years of data contain /archive/ in the path, while the most recent year does not. Removing this from the URL will make the download work.

Update Manual Matches

Manual matches are used to supplement and improve the generated matches. Given that these manual matches were mostly found using 2018 data, there are likely improvements/additions that could be found using 2021 data. The readme includes a section about contributing manual matches, so any work developing new matches could be contributed to the main repo.

NEEDS

The crosswalk includes an option for using National Electric Energy Data System (NEEDS) data in the matches. By default this data will not be used, but if it is it will use 2018 data as well. Including the 2021 version of this data will also require a minor URL update.

Results

By simply changing the EIA 860 URL, the entire crosswalk will run successfully. From comparing the number of matches found using 2021 data vs 2018 data, and spot checking a number of the 2021 matches, the outputs seem to be reasonable and consistent with those found for 2018.

zaneselvans commented 1 year ago

We've found over the years that the URLs and filenames are not particularly stable (even once you account for the archive / non-archive links), which is why we built the PUDL datastore. For reproducible analyses using those archived inputs might be helpful (another good use case for splitting the datastore out from the rest of the PUDL data pipeline infrastructure)

aesharpe commented 1 year ago

This is thrillingly simple! We should talk about whether / how we might want to archive this 2021 version in Zenodo.

zschira commented 1 year ago

I think the question for archiving is what RMI's use case for the crosswalk outputs will be? The easiest solution would be to just manually run the crosswalk with 2021 data, then stick the outputs somewhere. This could just be in a git repo, or we could put it on zenodo. If we wanted something a little more reproducible that would allow us to programmatically recreate the 2021 outputs, we could probably come up with some way to make that work. Do we have any more info on how/where RMI plans to use this?

aesharpe commented 1 year ago

@arengel what do you think about @zschira's proposal?

arengel commented 1 year ago

Our main requirement is that we can access it programmatically from Python. That could be having it accessible in a git repo (or Zenodo, or elsewhere) at some fixed URL, available in pudl.sqlite / PudlTabl, or created dynamically by code that could be called from Python.

Ultimately we'd like to be able to pull the final PUDL crosswalk (with subplant_id from #2491) that uses the most recent EIA and CEMS data from pudl.sqlite or wherever such things end up getting stored. So as much as where we want it stored, it's maybe as much a question of how it should be created and then stored as part of or input to your ETL.

zschira commented 1 year ago

It seems like a good start might be to manually run the crosswalk and store the outputs somewhere they can easily be accessed. Then we could look at setting up a little bit of infrastructure to create an archive in a more programmatic and reproducible way. I'll attend today's planning meeting so we can discuss some possible options.

zschira commented 1 year ago

I'm going to go ahead and close this issue, and track the archiving in development #2505.

catalyst-cooperative / pudl