catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Re-write clean_merge_asof() to do a generic merge on datetime components #1128

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

Our slightly off-label use of pd.merge_asof() to merge dataframes of different temporal resolution is slow and has some weird edge cases related to the tolerance parameter (how far afield should it look to find a "matching" datetime). A simpler, probably faster, option would be to simply explode a specified left and right datetime columns into a specified subset their component parts (year, month, day, etc.) and merge on those as integer keys. This would allow all the different kinds of normal merges to work, and leave it up to the user to choose to do reasonable things...

Right now clean_merge_asof() works for the year-to-year and year-to-month merges that broadcast in one direction (many-to-one where the many side is the left dataframe), but there are a couple of cases in which we need to do a many to many merge and that has to be done as a one-off right now.

This functionality is currently used in the mcoe and eia923 output functions.

zaneselvans commented 2 years ago

This will likely be taken care of by virtue of the new harvesting process, which will differentiate between different time frequencies in the column name ('report_year, report_month etc.) and the addition of multiple time frequencies to enable merges / FK relationships between tables of different frequencies. See #1247