catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

FERC plant records showing up in multiple time series #240

Closed zaneselvans closed 5 years ago

zaneselvans commented 5 years ago

For some reason a small fraction of the FERC plant records are showing up as associated with more than one FERC plant ID. This should never happen, and so it means something is broken! Deep down inside the where_matches() and best_matches() functions.

This is a sub-issue of #144

zaneselvans commented 5 years ago

The problem here is in the way the plant_id_ferc1 values are assigned. Right now we're considering a time series valid if every record within it only shows up in time series which are mutually consistent -- some of them may have gaps (depending on which year is used as the seed for matching) but none of them have other records that are in conflict with each other. This flexibility dramatically increases the number of records which are successfully assigned to a time series, but it makes assigning the ID a little more complicated, since we need to need to ensure that any record that shows up in any one of these potentially partial time series gets assigned the same plant ID. This means that it's the union of all records that are found to be associated with each other that make up the time series that gets a plant ID.

If we don't do this, than each of these different but mutually compatible collections of records end up being considered their own independent time series, and that means the records in them get used more than once, which is a no-no.

zaneselvans commented 5 years ago

I've got this more complicated assignment of plant_ids working now, in the same way that we did the generation unit assignment using connections between boilers and generators -- creating a graph and finding the connected components, each of which gets assigned a FERC plant ID. But there's still something fishy going on. Every single record_id should either end up assigned to a FERC plant_id in the main graph, or it should end up orphaned, and get assigned a FERC plant_id after the fact. But for some reason, there are still 91 records that end up without a plant ID. Still debugging.