catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Assign utility_id_pudl values to utilities with data in EIA 923 #544

Closed zaneselvans closed 4 years ago

zaneselvans commented 4 years ago

For some reason, there are still 865 utility_id_eia values for which there are records in the EIA 923 data, which do not have a utility_id_pudl assigned to them. For completeness, and just in case they map to existing FERC utilities somehow, it seems like we should get them added to the mapping spreadsheet.

Sub-tasks:

The records which have utility_id_eia values that lack utility_id_pudl have the following distribution... like we mapped all of the utilities initially when we did 2011-2015, but since then, we've been missing a chunk of utilities every year, and then there are some from 2009 too. Strangely nothing in 2010 though? Really?

frc_eia923 = pudl_raw.frc_eia923()
gf_eia923 = pudl_raw.gf_eia923()
gen_eia923 = pudl_raw.gen_eia923()
bf_eia923 = pudl_raw.bf_eia923()

missing_utils = set([])
missing_utils = missing_utils.union(set(frc_eia923[frc_eia923.utility_id_pudl.isna()].utility_id_eia.unique()))
missing_utils = missing_utils.union(set(gf_eia923[gf_eia923.utility_id_pudl.isna()].utility_id_eia.unique()))
missing_utils = missing_utils.union(set(bf_eia923[bf_eia923.utility_id_pudl.isna()].utility_id_eia.unique()))
missing_utils = missing_utils.union(set(gen_eia923[gen_eia923.utility_id_pudl.isna()].utility_id_eia.unique()))

bad_dates_frc = frc_eia923[frc_eia923.utility_id_eia.isin(missing_utils)][["report_date", "utility_id_eia"]]
bad_dates_gf = gf_eia923[gf_eia923.utility_id_eia.isin(missing_utils)][["report_date", "utility_id_eia"]]
bad_dates_gen = gen_eia923[gen_eia923.utility_id_eia.isin(missing_utils)][["report_date", "utility_id_eia"]]
bad_dates_bf = bf_eia923[bf_eia923.utility_id_eia.isin(missing_utils)][["report_date", "utility_id_eia"]]

bad_dates = pd.concat([bad_dates_frc, bad_dates_gf, bad_dates_gen, bad_dates_bf])
plt.hist(bad_dates.report_date, bins=121);

image.png

However, looking at records in the EIA 860 plants/generators/ownership tables, where the utility_id_pudl field is null (but by definition there's is a plant or generator present), we get this distribution of dates:

mup1 = plants_eia860[plants_eia860.utility_id_pudl.isna()][["utility_id_eia", "plant_id_eia", "report_date"]]
mup2 = gens_eia860[gens_eia860.utility_id_pudl.isna()][["utility_id_eia", "plant_id_eia", "report_date"]]
mup3 = own_eia860[own_eia860.utility_id_pudl.isna()][["utility_id_eia", "plant_id_eia", "report_date"]]
mup4 = own_eia860[own_eia860.utility_id_pudl.isna()][["owner_utility_id_eia", "plant_id_eia", "report_date"]].rename(columns={"owner_utility_id_eia": "utility_id_eia"})
missing_utils_plants = pd.concat([mup1, mup2, mup3, mup4])

image

zaneselvans commented 4 years ago

Hey @swinter2011 this is ready for you if you want to map / assign IDs to the plants that have data.

zaneselvans commented 4 years ago

@cmgosnell and I chatted more generally about the criteria for selecting EIA plants & utilities to double-check for FERC connections when doing the mapping this morning on the design call, an we decided that for now we don't want to try and pull in more EIA IDs -- since the PUDL IDs are really only for gluing FERC & EIA together now, we just need to figure out what combination of criteria (regulatory status, plant size, ownership type, etc.) will select a reasonable slice of EIA entities to double-check against FERC each year to make sure we didn't miss any connections. But we're going to put that off until next time around.