NREL / foundational-industry-energy-data

The Foundational Industry Energy Dataset (FIED) is a unit-level characterization of energy use in the U.S. industrial sector.
https://nrel.github.io/foundational-industry-energy-data/
2 stars 0 forks source link

NAICS code assignment #15

Open calmc opened 1 week ago

calmc commented 1 week ago

Is the feature related to a problem? Please describe. A single facility (unique Registry ID) may be assigned multiple NAICS codes across different EPA information systems. These NAICS codes may be in completely different sectors (e.g., agriculture vs. manufacturing vs. retail trade). However, the current approach to selecting a NAICS code in frs_extraction.py takes a naive approach and simply takes the first code reported for the value of naicsCode and keeps all additional codes for the value of naicsCodeAdditional. See the format_naics_csv method: https://github.com/NREL/foundational-industry-energy-data/blob/2c20c9f347d13e8c5c18e10f93f71fc6bcb4060c/fied/frs/frs_extraction.py#L291

Describe the proposed solution A solution should take a more informed approach to assigning NAICS codes when there are multiple different values across information systems. The solution should take into account for outlier assignments, such as shown in the example below where a single agriculture NAICS is listed with many manufacturing NAICS.

It's not clear at this point what the best solution is. Each of the alternatives identified below should be explored. A test should be written to compare the results of the proposed solutions, as well as the original NAICS code, highlighting where there is agreement or not.

Describe alternatives considered

  1. Preference the NAICS codes from a specific EPA information system? For example, since energy estimates are derived from NEI (EIS) and GHGRP information systems, should these NAICS codes be preferenced over NAICS codes from other information systems (TRIS, RCRAINFO, etc.)? If so, should NEI (EIS) > GHGRP or GHGRP > NEI (EIS)?
  2. Use the most prevalent NAICS codes? For example, calculate counts of each NAICS code and select the one with the largest count.

Additional context Add any other context or screenshots about the feature request here. Here's an example of the FRS Detailed Facility Information query for Registry ID == 110000408274: image

A more complicated example is Registry ID == 110001413239: image