catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
465 stars 107 forks source link

Foreign key constraint violations between plants_small_ferc1 and plants_ferc1 #1263

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

There are 240 ferc1 small plants which don't appear to be showing up as expected in the plants_ferc1 table. I think these were probably the ones that just got mapped and assigned PUDL IDs for the first time? It could be capitalization, or any other stray string out of place.

{'resource': 'plants_small_ferc1',
'foreign_key': ForeignKey(fields=['utility_id_ferc1', 'plant_name_ferc1'],
reference=ForeignKeyReference(resource='plants_ferc1', fields=['utility_id_ferc1', 'plant_name_ferc1'])),
'invalid':
       utility_id_ferc1                plant_name_ferc1
5893                  3  gold creek enterprise diesel 1
5894                  3  gold creek enterprise diesel 2
5895                  3  gold creek enterprise diesel 3
5896                  3    gold creek fairbanks morse 4
5897                  3    gold creek fairbanks morse 5
...                 ...                             ...
11693               210              highland wind farm
11695               210                 adams wind farm
12046                84             upper pacolet hydro
12218               210             ida grove wind farm
12219               210               o'brien wind farm

[240 rows x 2 columns]}]

For the moment I am going to disable this FK relationship, but we should try and get these names/IDs to line up.

zaneselvans commented 2 years ago

Some examples:

plants_ferc1 = pd.read_sql("plants_ferc1", pudl_engine)
plants_small = pd.read_sql("plants_small_ferc1", pudl_engine)

plants_small[[
    "report_year",
    "utility_id_ferc1",
    "plant_name_ferc1",
    "plant_name_original"
]][
    (plants_small.utility_id_ferc1==133)
    & (plants_small.plant_name_ferc1.str.contains("merced"))
]
report_year utility_id_ferc1 plant_name_ferc1 plant_name_original
578 1994 133 merced falls ferc no. 2467 merced falls ferc no. 2467
1317 1995 133 merced falls ferc no. 2467 merced falls ferc no. 2467
1518 1996 133 merced falls ferc no. 2467 merced falls ferc no. 2467
2439 1997 133 merced falls ferc no. 2467 merced falls ferc no. 2467
2927 1998 133 merced falls ferc no. 2467 merced falls ferc no. 2467
3797 1999 133 merced falls ferc no. 2467 merced falls ferc no. 2467
4243 2000 133 merced falls ferc no. 2467 merced falls ferc no. 2467
4662 2001 133 merced falls ferc no. 2467 merced falls ferc no. 2467
5067 2002 133 merced falls ferc no. 2467 merced falls ferc no. 2467
5514 2003 133 merced falls ferc no. 2467 merced falls ferc no. 2467
6062 2004 133 merced falls merced falls ferc no.2467
6632 2005 133 merced falls merced falls ferc no.2467
6920 2006 133 merced falls merced falls ferc no.2467
7480 2007 133 merced falls merced falls ferc no.2467
8260 2008 133 merced falls merced falls ferc no.2467
8539 2009 133 merced falls merced falls ferc no.2467
9267 2010 133 merced falls merced falls ferc no.2467
9769 2011 133 merced falls merced falls ferc no.2467
9907 2012 133 merced falls merced falls ferc no.2467
10878 2013 133 merced falls merced falls ferc no.2467
11394 2014 133 merced falls merced falls ferc no.2467
11535 2015 133 merced falls merced falls ferc no.2467
12262 2016 133 merced falls merced falls ferc no.2467
12799 2017 133 merced falls ferc no.2467 merced falls ferc no.2467
13834 2018 133 merced falls ferc no.2467 merced falls ferc no.2467
14235 2019 133 merced falls ferc no.2467 merced falls ferc no.2467
plants_ferc1[
    (plants_ferc1.utility_id_ferc1==133)
    & (plants_ferc1.plant_name_ferc1.str.contains("merced"))
]
utility_id_ferc1 plant_name_ferc1 plant_id_pudl
2578 133 merced falls ferc no.2467 1367
5044 133 merced falls ferc no. 2467 1367

So it looks like the plant_name_original has been used for the plant mapping, but a foreign key relationship exists by virtue of the naming that points at plant_name_ferc1 -- which is the canonical name for the column.

zaneselvans commented 2 years ago

So there are two plant name fields in the small plants: plant_name_ferc1 and plant_name_original and it looks like we've used the plant_name_original in the ID mapping. plant_name_ferc1 is currently a manually cleaned plant name based on the spreadsheet down under src/pudl/package_data/ferc/form1. It only covers 2004-2016, but it extracts the FERC permit number and a cleaner fuel type for those years. But it seems like we aren't ever going to update it so we should probably remove it and the changes it makes to the small plants table.

Then we'd rename plant_name_original to plant_name_ferc1 and at that point I think all the mappings would be valid.

Or I guess we could just rename plant_name_original to plant_name_ferc1 now, and call the other column plant_name_clean which would be a one line change.

@aesharpe are you relying on the ferc_license or plant_type columns in the small plants table? Besides the clean plant name they are the only ones coming from the spreadsheet data.