catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
466 stars 107 forks source link

Design one-many FERC-EIA connection process #1114

Closed aesharpe closed 2 years ago

aesharpe commented 3 years ago

DUPLICATE

We figured out how to do it with the overrides - now we just need to come up with a way to turn those groups of EIA records into a singular MUL record.

cmgosnell commented 2 years ago

I'd propose compiling all of our known one-many connections in a standardize place and generate new EIA plant-part records with the "many" records. These records could be generated at the end of the plant-part list process and we can subsequently use them to replace our one-many mappings.

This way we could treat these "many" maps just like the rest of the one-one connections. We know these are relatively rare, so there will not be a giant pile of "many"s. We can compile them mannually and generate them automatically.

The other options I see generally involve removing these one-many matches from the FERC-EIA matching process altogether (because the model does not know how to do one-many record linkages). This would involve keeping the one-many matches that we have manually identified off to the side for the FERC-EIA process... but it would also require us to do some aggregation of the "many"s after any mappings in order to treat the FERC-EIA (or Deprish-EIA) data as if it had a standard strucutre.

The way I see it, we either have to aggregate early or aggregate later and I'd rather just aggregate early and treat these connections with one standard methodology.

Does that sounds reasonable to you @aesharpe ??