The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
Currently the fuel_cost, hr_by_unit, and hr_by_gen outputs from the MCOE process end up having about the same number of records regardless of whether the frequency of the outputs is annual or monthly which is... just wrong. Looking at the report_date field in the output dataframes the monthly outputs really are annual. This came up in doing the data validation checks for the v0.4.0 release #681. See this code for example:
My first guess was that this might be related to the (for now) mandatory annual frequency of the net generation allocation process that @cmgosnell has been working on, but that's not involved here anywhere.
She then suggested that it might somehow be related to a very minor tweak I made to the pudl.helpers.merge_on_date_year() function, but reverting those changes results in exactly the same behavior.
Currently I'm at a loss, and am planning to add some defensive AssertionErrors into the MCOE calculation process that check whether the frequency of these dataframes matches the frequency of the pudl_out object that is creating them, which would be a good thing to have hanging out in the background anyway.
Tasks
[x] Create integration tests that check frequency of output tables, and would have caught this bug.
[x] Replace merge_on_date_year() and is_annual() with a simpler merge_asof() based solution.
[x] Re-run monthly MCOE and get a reasonable output.
[x] Update expected eia923 row counts in output validation tests
[x] Enforce ONLY MS and AS frequency in the pudl_out objects, since that's all we know will work.
[x] Refactor the many-to-many merge that transforms hr_by_unit into hr_by_gen by forcing the bga_gens data to be monthly.
[x] Update expected mcoe row counts in output validation tests.
[x] Write real docstrings for these mcoe output functions.
[x] Re-run data validation tests and get all reasonable outputs.
It turns out the main problem here was a normal merge that really needed to be a merge_asof() style merge. And then there are these other details that come up with the merge_asof() based solution... namely:
that it can't do a many-to-many merge as we need to take annual plant-unit-generator information and merge it with monthly plant-unit information to get monthly plant-unit-generator information, and
searching backwards until you find the right date to merge on is only the right thing to do so long as you stay within the given larger block of time that you're trying to merge on (here, typically, the same year of the attribute table)
Currently the
fuel_cost
,hr_by_unit
, andhr_by_gen
outputs from the MCOE process end up having about the same number of records regardless of whether the frequency of the outputs is annual or monthly which is... just wrong. Looking at thereport_date
field in the output dataframes the monthly outputs really are annual. This came up in doing the data validation checks for the v0.4.0 release #681. See this code for example:My first guess was that this might be related to the (for now) mandatory annual frequency of the net generation allocation process that @cmgosnell has been working on, but that's not involved here anywhere.
She then suggested that it might somehow be related to a very minor tweak I made to the
pudl.helpers.merge_on_date_year()
function, but reverting those changes results in exactly the same behavior.Currently I'm at a loss, and am planning to add some defensive AssertionErrors into the MCOE calculation process that check whether the frequency of these dataframes matches the frequency of the
pudl_out
object that is creating them, which would be a good thing to have hanging out in the background anyway.Tasks
MS
andAS
frequency in thepudl_out
objects, since that's all we know will work.hr_by_unit
intohr_by_gen
by forcing the bga_gens data to be monthly.