Open zaneselvans opened 5 years ago
Thankfully (and also mysteriously) this problem seems quite rare. Thus far I've only identified two cases -- the two which were causing longer-than-dataset-scale plant_id_ferc1
time series.
One of those (the 27 record long time series) duplicated almost every year (all but 2004, the first year), which involves FPL's Martin plant, (plant_id_pudl=367
). This duplication is triggered when ferc1_years
includes 2004, and any two or more later years, and it causes duplication in all the years after 2004.
The other duplicated only 2012. (the "ascutney" plant in Vermont, plant_id_pudl=24
). This 2012 duplication appears to also happen when only years [2011, 2012, 2013]
are included, but not when it's just [2011, 2012]
or when it's just [2012, 2013]
. The duplicated year is a weird one, in which there's some overlap between two utilities (Green Mountain Power and Central Vermont PSC) which own the plant before and after that year.
It turns out the Travis tests also trigger this behaviour, with two different plant_id_ferc1
values showing up with duplicate 2017 records. It also, curiously, in a single year ETL only manages to associate 880 of the 882 with normal FERC Plant IDs -- when they should all be indistinguishable as single-year plants. But it also says there are no orphan records -- so those two plants with more than 1 record in them have pre-emptively absorbed the "orphans" that would have existed, but should have just been normal 1-year plant records. if 2016 and 2017 are loaded, the problem doesn't manifest.
In some rare instances, there are FERC Plant time series which end up having more than one record from a given year. This shows up occasionally as FERC Plant IDs with more records associated with them than there are years in the dataset. This is... totally wrong and should never happen.
Need to: