Foundational Industry Energy Dataset (FIED)

Summary

This is an effort by the National Renewable Energy Laboratory (NREL) and Argonne National Laboratory (ANL) to create an experimental foundational industry dataset for energy and emissions analysis and modeling. The code draws from various publicly-available data, primarily from the U.S. EPA, to compile a data set on unit-level energy use and characterization for U.S. industrial facilities in 2017.

The FIED, and the accompanying technical report, can be downloaded from its Open Energy Data Initiative submission.

Getting Started

Manual Data Downloads

Due to the nature of how they are provided, several data sets must be manually downloaded before the code can be run sucessfully. These data sets and their director locations are:

Source Classification Codes (SCCs)
- Download from https://sor-scc-api.epa.gov/sccwebservices/sccsearch/
- Save to data/SCC/SCCDownload.csv
2017 National Emissions Inventory (NEI)
- Download from https://gaftp.epa.gov/air/nei/2017/data_summaries/2017v1/2017neiJan_facility_process_byregions.zip
- Save and unzip data to data/NEI/.
- nei_EF_calculations.py will format and combine the unzipped csv files into nei_ind_data.csv
GHGRP Emissions by Unit and Fuel Type
- Download from https://www.epa.gov/system/files/other-files/2022-10/emissions_by_unit_and_fuel_type_c_d_aa_10_2022.zip
- Save to data/GHGRP/
- ghgrp_fac_unit.py will unzip and format these data.

Environment

fied_environment.yml is the conda environment used when creating the foundational dataset. Its key dependencies include:

python=3.9.18=h6244533_0
pandas=1.2.0=py39h2e25243_1
numpy=1.23.4=py39hbccbffa_1
geopandas=0.12.1=pyhd8ed1ab_1
openpyxl=3.0.10=py39h2bbff1b_0

Compiling the FIED

In addition to manually downloading the above datasets, executing the calulations and data compilation requires two steps after activating the fied environment.

./frs/frs_extraction.py. This will download, extract, and format EPA FRS data. The resulting csv should be saved in data/FRS/.
fied_compilation.py. This will execute all of the remaining steps for compiling the foundational data set.

So, from the terminal or Anaconda prompt:

conda activate fied

python ./frs/frs_extraction.py

python fied_compilation.py

Directory Navigation

The underlying submodules and data are organized as follows:

analysis: Methods for analyzing and generating figures of the final dataset.
data: Most folders are created locally for organizing raw data. Contains a directory list.
energy: Not currently used. For future estimation of facility energy use based on alternative approaches.
frs: Methods for downloading and formatting EPA Facility Registry Service data.
geocoder: Methods for collecting missing geographical information for facilities.
ghgrp: Methods for estimating energy use from GHG emissions reported under EPA's Greenhouse Gas Reporting Program. Based on previous projects, such as the Industry Energy Data Book.
nei: Methods for downloading and formatting data from EPA's National Emissions Inventory and for using these data to characterize combustion units.
qpc: Methods for downloading and formatting operating hours reported under the Census Bureau's Quaterly Survey of Plant Capacity Utlization.
scc: Methods to download and apply EPA's Source Classification Codes for characterizing units.
tests: Testing. Currently very limited.
tools: Methods that act as various tools used across submodules.

Overivew of FIED Data Fields

Data fields are compiled and described in FIED_datafields.yml. All facilities in the data set are represented by their unique registryID, which is their EPA Facility Registry Service ID.

Many of these data fields were included in original EPA data sources. See the FRS data dictionary for more information.

Identity

In addition to registryID, other identifying fields include

eisFacilityID: EPA ID assigned to facilities reporting to the Emissions Inventory System (EIS).
ghgrpID: EPA ID assigned to facilities reporting under the Greenhouse Gas Reporting Program (GHGRP).
name: Name of facility.
locationDescription: Description of the facility location.
naicsCode: The facility's North American Industrial Classification System (NAICS) code.
naicsCodeAdditional: A facility may have additional NAICS codes assigned (e.g., different reporting systems may have different NAICS assigned).

Geography

Various levels of geographic identifiers are included, such as

geoID: see Census description of geographic identifiers (GEOIDs)
latitude: facility latitude.
longitude: facility longitude.
postalCode: facility U.S. ZIP code.
countyNAME: name of facility county.
countyFIPS: facility Federal Information Processing Series (FIPS) code.
stateName: name of facility state.
legislativeDisctrictNumber: facility congressional district number.

Units and Processes

Individual units are characterized (e.g., unit type, capacity, energy, throughput) where possible. Individual units may be associated with multiple processes.

designCapacity: design capacity of unit.
eisUnitID: U.S. EPA Emissions Inventory System (EIS) unit ID.
unitName: unit name.
unitType: reported or inferred unit type.
unitTypeStd: standardized unitType.
processDescription: description of process. Processes may have more than one unit associated with them.
eisProcessID: U.S. EPA Emissions Inventory System (EIS) process ID. Processes may have more than one unit associated with them.

Energy

Depending on the estimation approach, a unit may have a single estimate of energy use, or a range of energy estimates (i.e., minimum, median, upper quartile). Energy estimates based on the NEI are presented as a range.

energyMJ: energy estimate in MJ
energyMJ0: minimum of energy estimate, in MJ.
energyMJq2: median of energy estimate, in MJ.
energyMJq3: upper quartile of energy estimate, in MJ.
fuelType: combusted fuel type as reported by original data source.
fuelTypeStd: combusted fuel type, standardized.
energyEstimateSource: source of underlying data used to make energy estimate. Some energy values are provided directly by GHGRP data.

Greenhouse Gas (GHG) Emissions

ghgsTonneCO2e: GHG emissions estimate (or reported data) in metric tonnes CO2 equivalents.
ghgsTonneCO2eQ0: minimum of GHG emissions estimate in metric tonnes CO2 equivalents.
ghgsTonneCO2eQ2: median GHG emissions estimate in metric tonnes CO2 equivalents.
ghgsTonneCO2eQ2: upper quartile of GHG emissions estimate in metric tonnes CO2 equivalents.
ghgsEstimateSource: source of underlying data used to make energy estimate. GHGRP emissions data are used directly, as are some NEI data.

Other

We've attempted to include additional descriptive fields where possible. These tend to be sparsely populated at this time.

hucCode8: Hydrolic Unit Code. Not currently implemented.
weeklyOpHours: Average weekly operating hours by quarter, including 95% confidence interval ranges.
sensitiveInd: Indicates whether or not the associated data is enforcement sensitive.
envJusticeCode: The code that identifies the type of environmental justice concern affecting the facility or enforcement action.
smallBusInd: Code indicating whether or not a business is requesting relief under EPA’s Small Business Policy, which applies to businesses having less than 100 employees.
througputTonne: Estimated mass throughput.

NREL / foundational-industry-energy-data

readme