catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Ensure at most a single record from each year in each FERC Plant ID #305

Open zaneselvans opened 5 years ago

zaneselvans commented 5 years ago

In some rare instances, there are FERC Plant time series which end up having more than one record from a given year. This shows up occasionally as FERC Plant IDs with more records associated with them than there are years in the dataset. This is... totally wrong and should never happen.

Need to:

zaneselvans commented 5 years ago

Thankfully (and also mysteriously) this problem seems quite rare. Thus far I've only identified two cases -- the two which were causing longer-than-dataset-scale plant_id_ferc1 time series.

One of those (the 27 record long time series) duplicated almost every year (all but 2004, the first year), which involves FPL's Martin plant, (plant_id_pudl=367). This duplication is triggered when ferc1_years includes 2004, and any two or more later years, and it causes duplication in all the years after 2004.

The other duplicated only 2012. (the "ascutney" plant in Vermont, plant_id_pudl=24). This 2012 duplication appears to also happen when only years [2011, 2012, 2013] are included, but not when it's just [2011, 2012] or when it's just [2012, 2013]. The duplicated year is a weird one, in which there's some overlap between two utilities (Green Mountain Power and Central Vermont PSC) which own the plant before and after that year.

zaneselvans commented 5 years ago

It turns out the Travis tests also trigger this behaviour, with two different plant_id_ferc1 values showing up with duplicate 2017 records. It also, curiously, in a single year ETL only manages to associate 880 of the 882 with normal FERC Plant IDs -- when they should all be indistinguishable as single-year plants. But it also says there are no orphan records -- so those two plants with more than 1 record in them have pre-emptively absorbed the "orphans" that would have existed, but should have just been normal 1-year plant records. if 2016 and 2017 are loaded, the problem doesn't manifest.