catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Allow harvesting of association tables #640

Open zaneselvans opened 4 years ago

zaneselvans commented 4 years ago

In addition to being able to harvest a single value that's associated with a given entity permanently, or on a per-year basis, we also need to be able to harvest association tables -- all the observed combinations of several columns (e.g. report_date, balancing_authority_id_eia, utility_id_eia, and state).

zaneselvans commented 2 years ago

Hey @ezwelty did this end up getting integrated into the big PR #806?

ezwelty commented 2 years ago

Yes, if you mean all combinations of several columns, where all those columns are present. See "Harvest process" section of my top post in #806.

Resource.harvest_dfs() harvests from multiple named input dataframes. Only dataframes with all the primary key fields are included. If Resource.harvest.harvest=True, all such dataframes are harvested and, by default, also aggregated.

Note the "Only dataframes with all the primary key fields are included." So in your example, if you wanted to include combinations of report_date and balancing_authority_id_eia from a table without utility_id_eia and state, that would not work. Adding that functionality would be simple enough, by making Resource.format_df() insert empty columns for the missing primary key columns in the case of a partial match.