catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Assign PUDL IDs to EIA 860/923 old entities (2001-2008) and map them to FERC entities #1069

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

Integration of the EIA 923 data from 2001-2008 (see PR #1035) has added some new EIA plants and utilities that previously didn't appear, and so they are now showing up as unmapped in the PUDL plant / utility ID tables. In total there appear to be:

These plants and utilities need to be added to the manual PUDL ID mapping, and associated with the corresponding FERC plants and utilities which previously could not be mapped.

There will very likely be additional unmapped plants and utilities which appear as a result of the 2001-2003 EIA 860 data being integrated as well. It probably makes sense to do all the mapping for 2001-2008 at the same time. See also #848

zaneselvans commented 2 years ago

@bendnorman since you're working on #848 right now, you might want to look at this as the next thing to work on. We'll need to link up all the FERC/EIA utilities and plants for these older years once you've got them integrated.

bendnorman commented 2 years ago

I reran devtools/ferc1-eia-glue/ferc1-eia-glue.ipynb on EIA 923 and 860 2001-2019. In total there appear to be:

@zaneselvans how did you get 387 unmapped utilities? I double-checked the data on datasette and there are 12790 unique utility_id_eia's and there are only 6776 mapped utilities in mapping_eia923_ferc1.xlsx.

I sent csvs of the unmapped utilities and plants to @stevenbwinter. I think many of the unmapped utilities have very small total capacities.

zaneselvans commented 2 years ago

Please also review / try to fix the PUDL IDs from #736 into this.

bendnorman commented 2 years ago

I think the unmapped csvs slipped through the cracks. I reran the ferc1-eia-glue notebook to produce the unmapped data with the 2001 - 2019 EIA data. We are waiting for the 923 data before we map the 2020 data. The outputs live in this zipfile on google drive.

zaneselvans commented 2 years ago

I've re-run it with the 2020 EIA and FERC 1 data as well and uploaded the CSVs to google drive

TrentonBush commented 2 years ago

I re-ran it with data from 2001-2020 plus 860m data from july 2021. I uploaded to the same google drive folder

zaneselvans commented 2 years ago

@cmgosnell and I sat down on Friday afternoon and went through the notebook to try and weed out extraneous information, simplify the outputs, and make it clearer what should be done with them. I've turned it into a script that generates 4 CSV files, listing unmapped plants/utilities for ferc1/eia. The EIA files have a column indicating whether we should try and link that plant or utility to a FERC 1 plant -- for Utilities, it's only the ones that are associated with plants that reported data in the EIA 923. For the plants, it's only those above a certain capacity threshold (5 MW).

However, all of the plant and utility IDs should make it into the mapping spreadsheet -- it's just that most of them will have automatically generated PUDL IDs that aren't used to link FERC & EIA plants/utilities. This way we can always expect to find 0 unmapped plants and utilities, and structurally they'll all conform to the same relationships in the database.

With those criteria, and our previous snafu with leaving out the 2020 FERC 1 data somehow, we've still got the following to map:

Here are the CSV files

zaneselvans commented 2 years ago

Oh and this is what the help message from the script looks like...

usage: find_unmapped_plants_utils.py [-h]

Identify all unmapped plant and utility IDs in FERC 1 and EIA data.

This script identifies plants and utilities which exist in the FERC 1 and EIA data that
do not yet appear in our manually compiled mapping between the two datasets.

For the FERC 1 plants/utilities it compares the IDs in the spreadsheet to the utility
IDs and plant names that show up in the raw ferc1.sqlite DB.

For the EIA plants/utilities it comapres the IDs in the spreadsheet to the utility and
plant IDs that appear in the PUDL DB.

Before running this script:
===========================
* Load all available years of EIA 860/923 data into your PUDL DB. This includes the new
  year of data to be integrated, so you'll have to do the file, tab, and column mapping
  first, and may need to do a bit of data wrangling to get the transform steps to work.
  You'll probably want to load this data into the PUDL DB without any FERC Form 1 data.
* Load all available years of FERC Form 1 data into your FERC 1 DB. Again, this includes
  the new year of data to be integrated.

If the script sees that all available years of data are not loaded in your databases, it
will raise an AssertionError.

If there is an unexpectedly large number of "lost" EIA plants or utilities (which appear
in the mapping spreadsheet but not the database) it will raise an AssertionError. This
usually indicates that not all of the available data is loaded into the DB.

Script Outputs:
===============
unmapped_utils_ferc1.csv:
    Respondent IDs and respondent names of utilities which appear in the FERC Form 1 DB,
    but which do not appear in the PUDL ID mapping spreadsheet. We should attempt to
    find the corresponding EIA utilities for all of them.

unmapped_plants_ferc1.csv:
    Plant names, respondent names, and respondent IDs associated with plants that appear
    in the FERC Form 1 DB, but which do not appear in the PUDL ID Mapping spreadsheet.
    We should try and find corresponding EIA plants for all of them. This output
    includes large steam plants, as well as small, hydro, and pumped storage plants,
    with the table of origin indicated.

unmapped_utils_eia.csv:
    EIA Utility IDs and names of utilities which appear in the PUDL DB, but which do not
    appear in the PUDL ID mapping spreadsheet. We should only attempt to link EIA
    utilities to their FERC 1 counterparts if they are associated with plants that
    reported data in the EIA 923. These utilities have "True" in the "link_to_ferc1"
    column. All other EIA utility IDs should get added to the mapping spreadsheet with
    an automatically assigned PUDL ID.

unmapped_plants_eia.csv:
    EIA Plant IDs and Plant Names of plants which appear in the PUDL DB, but which do
    not appear in the PUDL ID mapping spreadsheet. The Utility ID and Name for the
    primary plant operator, as well as the aggregate plant capacity and the state the
    plant is located in are also proved to aid in PUDL ID mapping. We don't attempt to
    link all plants to their FERC 1 counterparts, only those with a capacity over some
    minimum threshold. These plants are indicated with "True" in the "link_to_ferc1"
    column. All other EIA PLant IDs should get added to the mapping spreadsheet with an
    automatically assigned PUDL ID.

optional arguments:
  -h, --help  show this help message and exit