USEPA / camd-eia-crosswalk

A data crosswalk to integrate U.S. power sector emission and operation data from EPA to EIA
MIT License
38 stars 12 forks source link
oar

EPA—EIA Power Sector Data Crosswalk

A data crosswalk to integrate U.S. power sector emission and operation data from the Environmental Protection Agency (EPA) and Energy Information Administration (EIA). Provided in this repo are both the outputs of the crosswalk (csv and xlsx format) and the R script (in an R markdown document) that generates these outputs.

Notice: This is a work in progress. EPA's Clean Air Markets Division (CAMD) continues to refine its methodology and quality assurance procedures and will provide information on updates as necessary.

Update (v0.3): As of October 2022, CAMD has added manual matches through a unit-by-unit investigation as well as new datasets, as described below. CAMD has also fixed an issue with the sequence numbers (a column added to maintain the order of the units in Excel), resulting in changes from v0.2.

Background


CAMD and EIA provide two of the most comprehensive and commonly used data sets about the electric power sector. These two data sets include information on emissions, electricity generation, fuel use, operations, and attributes of power plants across the United States. Many researchers and data users find useful information in both data sets. However, key differences in the purpose and manner of data collection by these agencies contribute to difficulties in merging the two data sets. In providing this crosswalk, which relates key identifiers assigned to power plants and units used by both agencies, CAMD is hoping to make it easier to integrate both data sets.

Update (v0.3): In addition, CAMD is providing code to facilitate matching to two other relevant EPA datasets: Facility Registry Service (FRS) identifiers and National Electric Energy Data System (NEEDS) identifiers. Some additional information about these two datasets is provided below.

The crosswalk is available as a spreadsheet (xlsx) and delimited text file (csv) for those who are just looking to integrate the two data sets. The R script for producing the crosswalk from the two data sets is also available for those interested in how it was made or adapting it. Please see "Contributing to the Crosswalk" for more information on how to improve it.

What facilities are included in each data set?

Generally speaking, CAMD collects the Power Sector Emissions Data from fossil fuel-fired electric generating units (EGUs) over 25 MW in nameplate capacity. EIA collects data on all EGUs (including nuclear and renewables) that are located at a plant that is over 1 MW in nameplate capacity and connected to the electricity grid.

Update (v0.3): FRS assigns facility identifiers to facilities that submit data to EPA under any program, including EGUs. NEEDS combines the list of EGUs from both the Power Sector Emissions Data and EIA datasets, while assigning its own identifiers to those EGUs.

How are these data collected?

EGUs that submit data to CAMD generally use continuous emissions monitoring systems (CEMS) and data acquisition and handling system to collect emissions and operations data. EGUs submit the hourly data to CAMD every calendar quarter using EPA-provided data validation and reporting software. EGUs submit data to EIA through various surveys with various reporting periods.

Why are these data collected?

CAMD is empowered to collect the Power Sector Emissions Data to ensure compliance with certain EPA and state air quality regulatory programs that affect the electric power sector. EIA collects a variety of energy data to provide independent analysis to the public.

What is the temporal resolution of each data set?

CAMD collects hourly emissions and operation data at the combustion unit (e.g., boiler) level. Power plants submit these data to EPA every calendar quarter. Generally, EIA provides monthly and annual data at the power plant and/or generating unit levels.

Update (v0.3): NEEDS contains data used for modeling future power sector outcomes in the Integrated Planning Model (IPM). FRS IDs are assigned to facilities that are a part of at least one of several national and state information/data/regulatory programs. FRS IDs associated with EGUs are unlikely to change, as their underlying Plant IDs are unlikely to change. New or retired EGUs would be added or removed from the FRS in accordance with updates to the CAMD or EIA databases.

What is the lowest level of spatial aggregation for each data set?

CAMD and EIA both collect unit-level data. However, there is often confusion about what CAMD considers a "unit" versus what EIA considers a "unit."

In general, a steam power plant has a boiler that combusts fuel used to produce steam, the energy of the steam is used to rotate a steam turbine and a generator that converts the kinetic energy of rotation into electrical energy (refer to Figure 1.A below). Because CAMD's purpose for collecting data is environmental compliance, CAMD defines a unit as the source of emissions, the combustion unit (e.g., boiler). On the other hand, EIA is primarily focused on collecting data on electrical generation, so EIA defines a unit as the source of electricity, the generator. While many power plants have a one-to-one relationship between boilers and generators, many do not. As in Figure 1.B below, there can be two (or more) boilers that serve one generator. In CAMD's database, the plant illustrated in Figure 1.B would list two units (two boilers); in EIA's database, this plant would list only one unit (one generator).

Like steam power plants, gas turbines can also have one-to-one relationships (Figure 1.C) or complex configurations. Figure 1.D illustrates a common natural gas combined cycle configuration with multiple gas turbines providing excess heat to a single steam turbine. In CAMD's database, the plant in Figure 1.D would list two units (two gas turbines); in EIA's database, this plant would list three units (two gas turbines/generators and one steam turbine/generator).

Figure 1. Fossil Fuel-Fired Power Plant Diagrams

Figure 1. Fossil Fuel-Fired Power Plant Diagrams

Each power plant is identified by an Office of Regulatory Information Systems Plant Location code (ORISPL or ORIS), a unique plant identifier. With a few exceptions (refer to "Methodology" for more information), the unique IDs are consistent between CAMD and EIA. In the EIA database the ID is called a "Plant ID", while in the EPA/CAMD database it's called an "ORIS code". This ID may also be referred to as an "ORISPL", "ORISPL code", "Facility ID", or "Facility code", among other names. This crosswalk will use the term "Plant ID" for the power plant identifier. The boiler and the generator have separate IDs. These IDs may be the same (for example, the boiler ID in Figure 1.A may be listed as 1 with the associated generator ID also listed as 1), but they may be different, which creates difficulties in joining CAMD and EIA data. In addition, the IDs for the same plant component may differ between the two data sets due to reporting inconsistencies.

For more information on CAMD's Power Sector Emissions Data, see https://www.epa.gov/airmarkets/power-sector-emissions-data.

For more information on EIA's electricity data, see https://www.eia.gov/electricity/data/guide/.

How do I cite the crosswalk data?

Huetteman, Justine; Tafoya, Johnathan; Johnson, Travis; and Schreifels, Jeremy. 2021. EPA-EIA Power Sector Data Crosswalk. Accessible at www.epa.gov/airmarkets/power-sector-data-crosswalk.

Methodology


In general, the crosswalk R script matches CAMD data and EIA data based first on exact matches of unique identifiers followed by matches based on algorithms that detect similar matches of unique identifiers, but may not be exact (i.e., "fuzzy matching"). This second type of matching resolves many of the cases where there are reporting discrepancies between data submitted to CAMD and to EIA.

CAMD data are retrieved using the Field Audit Checklist Tool (FACT) API. EIA data are retrieved from the Form EIA-860. Note some CAMD units do not have an associated generator ID in CAMD's data (refer to "Important Notes" section below).

The R script uses the following tiered approach to match EPA/CAMD data to EIA data at each facility:

  1. The CAMD Plant ID and generator ID are matched to the EIA Plant ID and generator ID (from EIA-860 3_1_Generator_Y{data year}) according to these steps:

    i. Matches and exclusions from the manual_matches.xlsx file are pulled in before any matching occurs to prevent duplicates

    ii. Exact matching between both data sets

    iii."Fuzzy" matching between both data sets

  2. The CAMD Plant ID, generator ID, and unit ID are matched to the EIA Plant ID, generator ID, and boiler ID (from EIA-860 6_1_EnviroAssoc_Y{data year}) according to these steps:

    i. Matches and exclusions from the manual_matches.xlsx file are pulled in before any matching occurs to prevent duplicates

    ii. Exact matching between both data sets

    iii. "Fuzzy" matching between both data sets

  3. The results from Steps 1 and 2 are joined, resulting in a set of comprehensive matches that have all CAMD identifiers and all EIA identifiers where they exist. CAMD units that did not match in any step are added to the crosswalk with an indicator that they were unmatched.

Manual matches between CAMD data and EIA data that would not be captured via exact matching or fuzzy matches, are added to the crosswalk from the manual_matches.xlsx file. Any CAMD units in the manual match file that should be excluded from the matching process, mostly due to the lack of a connection to the electricity grid (e.g., industrial boilers), are added to the crosswalk with an indicator that they were manually excluded.

Due to the complex one-to-many, many-to-one, and many-to-many relationships between units/boilers and generators, EIA boiler ID data must be matched to CAMD data in a separate step. In addition, as stated previously, sometimes the units/boilers and generators, though connected, do not have the same unique identifier, which would preclude cross-matching based on CAMD generator ID and EIA boiler ID.

Within each matching step, there are sub-steps that perform the exact and "fuzzy" matching processes. The first sub-step is an exact match on the CAMD and EIA identifiers. The next sub-steps each perform an operation on the identifiers to modify each of them to a common state. This modification is performed on the original identifier and the result is placed in a field of the same name preceded by "MOD_" (e.g. CAMD_UNIT_ID and EIA_BOILER_ID are each modified in Step 2b to remove special characters and whitespaces. The result of the modification is placed in the MOD_CAMD_UNIT_ID and MOD_EIA_BOILER_ID fields, respectively).

In rare instances, the Plant IDs do not match between CAMD and EIA's databases. These discrepancies were discovered through the production of eGRID and are regularly tracked and updated with new eGRID releases. In the crosswalk R script, these discrepancies are accounted for before any matching occurs–EIA's plant ID is modified ("MOD_EIA_PLANT_ID") to match CAMD's plant ID and includes a field indicating that the plant ID has been changed ("PLANT_ID_CHANGE_FLAG").

Other discrepancies or missing data may exist and are manually investigated and added to the manual_matches.xlsx. In the R script, these matches are done first. This helps remove duplicates and improve accuracy. If other manual matches are found, they will be added to this file, and anyone can create a pull request adding matches and a CAMD staff member will review the pull request to incorporate the new matches.

Update (v0.3): Once the matching steps have been made between CAMD and EIA data, steps to add FRS IDs and NEEDS IDs will run in the code. These steps will run only if the user has set the options for additional datasets to be included towards the beginning of the code (i.e., “include_FRS” and/or “include_NEEDS” set to TRUE).

To add FRS IDs, the code calls the FRS ID REST API. These IDs are joined to CAMD data via CAMD_PLANT_ID. These IDs are at the facility level, so all subcomponents (generators and boilers) at one facility will be assigned the same FRS ID.

To add NEEDS IDs, the code imports the NEEDS flat file from November 2018, which is available in the data file in this repo. NEEDS IDs (“NEEDS_UNIQUE_ID”) are a concatenated version of the EIA Plant ID, EIA Generator/Boiler ID, and a letter indicating whether the unit is a boiler (“B”) or a generator (“G”). Generators from NEEDS are matched on EIA_PLANT_ID and EIA_GENERATOR_ID. If the unit has an associated EIA_BOILER_ID, boilers from NEEDS are matched on EIA_PLANT_ID and EIA_BOILER_ID.

For more information on CAMD's FACT API and to sign up for an API key, see https://www.epa.gov/airmarkets/field-audit-checklist-tool-fact-api#/.

For more information on the EIA-860, see https://www.eia.gov/electricity/data/eia860/.

Output

The resulting crosswalk includes an xlsx spreadsheet (epa_eia_crosswalk.xlsx) and csv file (epa_eia_crosswalk.csv), which list all boilers and generators in CAMD's database with their corresponding EIA boiler and generator if they have a match. Unmatched CAMD units are included with a "CAMD Unmatched" label in the "MATCH_TYPE" columns. Generators listed in EIA data that are not matched to CAMD data are omitted. The fields "MATCH_TYPE_GEN" and "MATCH_TYPE_BOILER" indicate how the generators and boilers were matched. The fields included in the final crosswalk are listed and described in the table below.

Column Name Description
SEQUENCE NUMBER Row number assigned to each observation. Included for purposes of sorting to original order.
CAMD_STATE The state where the facility is located in CAMD's data.
CAMD_FACILITY_NAME The name of the facility in CAMD's data.
CAMD_PLANT_ID The unique ID (also known as ORIS code/ORISPL code) of the facility in CAMD's data. (CAMD key)
CAMD_UNIT_ID The unique ID for a combustion unit at a facility in CAMD's data. (CAMD key)
CAMD_GENERATOR_ID The unique ID for a generator at a facility in CAMD's data. (CAMD key)
CAMD_NAMEPLATE_CAPACITY The maximum rated output of the generator in CAMD's data, prime mover, or other electric power production equipment under specific conditions designated by the manufacturer. Expressed in MW.
CAMD_FUEL_TYPE The primary fuel type for the unit in CAMD's data.
CAMD_LATITUDE The latitude of the facility in CAMD's data.
CAMD_LONGITUDE The longitude of the facility in CAMD's data.
CAMD_STATUS The status of the unit in CAMD's data. Either OPR (Operating), LTCS (Long-term cold storage), or RET (Retired).
CAMD_STATUS_DATE The date on which the status of the unit last changed in CAMD's data. For example, the date on which the unit changed from operating to retired.
CAMD_RETIRE_YEAR The year in which the unit retired in CAMD's data.
MOD_CAMD_UNIT_ID The resulting CAMD unit ID used to create a match between CAMD and EIA. It may be the same as the original if it was matched without modifications.
MOD_CAMD_GENERATOR_ID The resulting CAMD generator ID used to create a match between CAMD and EIA. It may be the same as the original if it was matched without modifications.
EIA_STATE The state where the facility is located in EIA's data.
EIA_PLANT_NAME The name of the facility in EIA's data.
EIA_PLANT_ID The unique ID of the facility in EIA's data. (EIA key)
EIA_GENERATOR_ID The unique ID for a generator at a facility in EIA's data. (EIA key)
EIA_NAMEPLATE_CAPACITY The highest value on the generator nameplate in megawatts rounded to the nearest tenth. Expressed in MW.
EIA_BOILER_ID The unique ID for a steam boiler unit at a facility in EIA's data. (EIA key)
EIA_UNIT_TYPE The prime mover (devices—e.g., gas turbine, steam turbine—that convert fuels to electrical energy via a generator) for an EIA unit. See reference table 2 in EIA-60 LayoutY{data_year}.xlsx or https://www.epa.gov/sites/production/files/2017-01/egrid_code_lookup.xlsx
EIA_FUEL_TYPE The primary fuel type for the unit in EIA's data.
EIA_LATITUDE The latitude of the facility in EIA's data.
EIA_LONGITUDE The longitude of the facility in EIA's data.
EIA_RETIRE_YEAR The year in which the unit retired in EIA's data.
PLANT_ID_CHANGE_FLAG A flag to indicate whether the EIA Plant ID was changed to match a CAMD ORIS Code according to a list of known discrepancies between CAMD ORIS code and EIA Plant ID. The list is linked in the README.
MOD_EIA_PLANT_ID The resulting EIA plant ID used to create a match between CAMD and EIA before any matching occurs. See "plant_id_manual_matches" sheet within the manual matches file.
MOD_EIA_BOILER_ID The resulting EIA boiler ID used to create a match between CAMD and EIA during Step 2. It may be the same as the original if it was matched without modifications.
MOD_EIA_GENERATOR_ID_BOILER The resulting EIA generator ID used to create a match between CAMD and EIA during Step 2. It may be the same as the original if it was matched without modifications.
MOD_EIA_GENERATOR_ID_GEN The resulting EIA generator ID used to create a match between CAMD and EIA during Step 1, matching CAMD generators to EIA generators on CAMD and EIA plant and generator IDs. It may be the same as the original if it was matched without modifications.
MATCH_TYPE_GEN The type of match made during Step 1, matching CAMD generators to EIA generators on CAMD and EIA plant and generator IDs. Any applied modifier sub-steps are also indicated in this field.
MATCH_TYPE_BOILER The type of match made during Step 2, matching CAMD units and generators to EIA boilers and generators on CAMD and EIA plant, unit/boiler, and generator IDs. Any applied modifier sub-steps are also indicated in this field.

Update (v0.3): Columns for additional datasets

  • FRS ID: The FRS ID associated with each observation.
  • NEEDS_UNIQUE_ID: The NEEDS ID associated with each observation.

Important Notes


Contributing to the Crosswalk


Thanks for taking the time to contribute! You can help improve the crosswalk by identifying mismatched units, adding new matches, and contributing updates to the R script.

Getting Started

The data for this crosswalk can be found from these two sources:

The crosswalk script is built using R with tidyverse packages and styling.

Additions to Manual Matches

If you investigate a source that is currently unmatched and find information that leads to a match, please fork the repository and add the match details to the manual_matches.xlsx file with a reason and sources that validate the match. A CAMD staff member will review the pull-request and if it is a valid match, it will be included in the next release of the crosswalk. Alternatively, you can open a new issue with information about the new match (see the Issues section below). A CAMD staff member will review the pull-request and if it is a valid match, it will be included in the next release of the crosswalk.

When adding a new match, make sure to:

Issues

Ensure the issue was not already reported by searching on Github under issues. If you're unable to find an open issue addressing the bug, open a new issue.

When writing an issue please write detailed information to help us understand the issue.

For example:

Pull Requests

Pull requests are always welcome!

Examples

To assist users in understanding how to implement the crosswalk, CAMD included a short sample analysis (sample_analysis.Rmd) that employs the crosswalk to connect CAMD and EIA data. In this sample analysis, annual nitrogen oxide (NOX) emission rates are generated for coal-fired EGUs (specifically, generators) in Alabama based on CAMD data on NOX emissions and EIA data on net generation for the year 2018. The result is a plot comparing NOX emission rates with electricity output.

Disclaimer

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.