Closed zaneselvans closed 2 years ago
I've got an idea from the work we've been doing from the CCAI work. We wanted a list of sub-components within each of the plant-part records. Because each plant-part record is a compilation of generator records, every plant-part record is an aggregation of generator records (or one generator record in the context of plant_part = "plant_gen"
).
We now have this one (plant-part record) to many (composite generator(s)).
I think this can be employed to drastically reduce the complexity and increase the speed of labeling the true/distinct plant-part records.
The fuzzy FERC to EIA merge depends on a complicated aggregation of lots of different possible combinations of generators and their ownership slices for connection to the messy mix of stuff that's reported in FERC.
One step in this process is particularly difficult to understand and slow right now, where the different possible parts of plants are labelled as either distinct or duplicatative, with a prioritization as to which kind of part should be kept if duplicates are found. Currently this is implemented by
pudl.analysis.plant_parts_eia.LabelTrueGranularities
The fact that there's a hierarchy of different plant parts, with
plant
and the top (the largest possible group, including all generators within the plant) andgenerator
at the bottom (the smallest possible unit) and that decisions are being made based on the relationships between a particular plant part and its "parent" and "child" parts in the hierarchy makes me think that this might be some kind of common tree pruning problem that already has a solution and maybe an implementation within NetworkX.