Assign utility_id_pudl values to utilities with data in EIA 923

zaneselvans commented 4 years ago

For some reason, there are still 865 utility_id_eia values for which there are records in the EIA 923 data, which do not have a utility_id_pudl assigned to them. For completeness, and just in case they map to existing FERC utilities somehow, it seems like we should get them added to the mapping spreadsheet.

Sub-tasks:

[x] Update the ferc1_eia glue notebook to generate this list of utilities (have data, but no PUDL utility ID) when we are looking for new utilities each year.
[ ] Get the utilities mapped in the spreadsheet, or at least assigned IDs.
[ ] Re-run the data validation and also the EIA-FERC comparison notebook to see if anything weird happened as a result of getting these utilities mapped.

The records which have utility_id_eia values that lack utility_id_pudl have the following distribution... like we mapped all of the utilities initially when we did 2011-2015, but since then, we've been missing a chunk of utilities every year, and then there are some from 2009 too. Strangely nothing in 2010 though? Really?

frc_eia923 = pudl_raw.frc_eia923()
gf_eia923 = pudl_raw.gf_eia923()
gen_eia923 = pudl_raw.gen_eia923()
bf_eia923 = pudl_raw.bf_eia923()

missing_utils = set([])
missing_utils = missing_utils.union(set(frc_eia923[frc_eia923.utility_id_pudl.isna()].utility_id_eia.unique()))
missing_utils = missing_utils.union(set(gf_eia923[gf_eia923.utility_id_pudl.isna()].utility_id_eia.unique()))
missing_utils = missing_utils.union(set(bf_eia923[bf_eia923.utility_id_pudl.isna()].utility_id_eia.unique()))
missing_utils = missing_utils.union(set(gen_eia923[gen_eia923.utility_id_pudl.isna()].utility_id_eia.unique()))

bad_dates_frc = frc_eia923[frc_eia923.utility_id_eia.isin(missing_utils)][["report_date", "utility_id_eia"]]
bad_dates_gf = gf_eia923[gf_eia923.utility_id_eia.isin(missing_utils)][["report_date", "utility_id_eia"]]
bad_dates_gen = gen_eia923[gen_eia923.utility_id_eia.isin(missing_utils)][["report_date", "utility_id_eia"]]
bad_dates_bf = bf_eia923[bf_eia923.utility_id_eia.isin(missing_utils)][["report_date", "utility_id_eia"]]

bad_dates = pd.concat([bad_dates_frc, bad_dates_gf, bad_dates_gen, bad_dates_bf])
plt.hist(bad_dates.report_date, bins=121);

However, looking at records in the EIA 860 plants/generators/ownership tables, where the utility_id_pudl field is null (but by definition there's is a plant or generator present), we get this distribution of dates:

mup1 = plants_eia860[plants_eia860.utility_id_pudl.isna()][["utility_id_eia", "plant_id_eia", "report_date"]]
mup2 = gens_eia860[gens_eia860.utility_id_pudl.isna()][["utility_id_eia", "plant_id_eia", "report_date"]]
mup3 = own_eia860[own_eia860.utility_id_pudl.isna()][["utility_id_eia", "plant_id_eia", "report_date"]]
mup4 = own_eia860[own_eia860.utility_id_pudl.isna()][["owner_utility_id_eia", "plant_id_eia", "report_date"]].rename(columns={"owner_utility_id_eia": "utility_id_eia"})
missing_utils_plants = pd.concat([mup1, mup2, mup3, mup4])

zaneselvans commented 4 years ago

Hey @swinter2011 this is ready for you if you want to map / assign IDs to the plants that have data.

zaneselvans commented 4 years ago

@cmgosnell and I chatted more generally about the criteria for selecting EIA plants & utilities to double-check for FERC connections when doing the mapping this morning on the design call, an we decided that for now we don't want to try and pull in more EIA IDs -- since the PUDL IDs are really only for gluing FERC & EIA together now, we just need to figure out what combination of criteria (regulatory status, plant size, ownership type, etc.) will select a reasonable slice of EIA entities to double-check against FERC each year to make sure we didn't miss any connections. But we're going to put that off until next time around.

catalyst-cooperative / pudl

Assign utility_id_pudl values to utilities with data in EIA 923 #544

Sub-tasks: