catalyst-cooperative / rmi-ferc1-eia

A collaboration with RMI to integrate FERC Form 1 and EIA CapEx and OpEx reporting
MIT License
3 stars 3 forks source link

WRONG-REPO / DELETED Figure out how/when to integrate the EPA CAMD - EIA crosswalk #256

Closed aesharpe closed 1 year ago

aesharpe commented 1 year ago

Right now the EPA-EIA crosswalk file is only loaded into the pudl db if the EIA data is also getting loaded into the db. This is because the crosswalk depends on EIA for foreign key validation.

The CEMS data also relies on the crosswalk data for access to accurate plant_id_eia values. The values we previously called plant_id_eia in the CEMS data are actually EPA's estimated ORISPL codes. The crosswalk connects these plant-level estimates to the actual EIA codes via a plant_id_epa, unit_id_epa to plant_id_eia map. Most of the plant IDs are identical across EPA and EIA, but a few are not.

We currently rely on the plant_id_eia field in CEMS fix some of the date entries. The fix_up_dates() function in the epacems transform module uses the plant_id_eia field to map to another dataframe with plant_id_eia and timezone fields.

If we want to use this mapping function accurately, we should merge the crosswalk into the CEMS data first. Merging the crosswalk with CEMS in the transform step is all well and good except that it would now require users that just want to work with CEMS data to also download the EIA data (because the CEMS needs the crosswalk which needs EIA).

How to folks feel about this?