catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

Refactor LabelTrueGranularities for clarity and speed #1265

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

The fuzzy FERC to EIA merge depends on a complicated aggregation of lots of different possible combinations of generators and their ownership slices for connection to the messy mix of stuff that's reported in FERC.

One step in this process is particularly difficult to understand and slow right now, where the different possible parts of plants are labelled as either distinct or duplicatative, with a prioritization as to which kind of part should be kept if duplicates are found. Currently this is implemented by pudl.analysis.plant_parts_eia.LabelTrueGranularities

The fact that there's a hierarchy of different plant parts, with plant and the top (the largest possible group, including all generators within the plant) and generator at the bottom (the smallest possible unit) and that decisions are being made based on the relationships between a particular plant part and its "parent" and "child" parts in the hierarchy makes me think that this might be some kind of common tree pruning problem that already has a solution and maybe an implementation within NetworkX.

cmgosnell commented 2 years ago

I've got an idea from the work we've been doing from the CCAI work. We wanted a list of sub-components within each of the plant-part records. Because each plant-part record is a compilation of generator records, every plant-part record is an aggregation of generator records (or one generator record in the context of plant_part = "plant_gen").

We now have this one (plant-part record) to many (composite generator(s)).

I think this can be employed to drastically reduce the complexity and increase the speed of labeling the true/distinct plant-part records.

what we need