Up until merging in the scenario data, the expansion of the data with the scenario_geography and equity_market columns drastically multiplies the number of rows in the data, and the grouped calculations necessitated by these otherwise duplicated rows is a source of the incredibly long run times. Basically for every combination of id, technology, and year we are multiplying the rows by every combination of scenario_geography and equity_market and calculating duplicate data for all of them.
We should carefully consider if this is actually necessary, and if not calculate as much and we can before expanding to the scenario_geography and equity_market values. @jdhoffa @jacobvjk @AlexAxthelm
https://github.com/RMI-PACTA/pacta.data.preparation/blob/ba0f8b8518afb2d00bfe5d9bff1a935418eaa5dd/R/dataprep_abcd_scen_connection.R#L143-L151
Up until merging in the scenario data, the expansion of the data with the
scenario_geography
andequity_market
columns drastically multiplies the number of rows in the data, and the grouped calculations necessitated by these otherwise duplicated rows is a source of the incredibly long run times. Basically for every combination ofid
,technology
, andyear
we are multiplying the rows by every combination ofscenario_geography
andequity_market
and calculating duplicate data for all of them.We should carefully consider if this is actually necessary, and if not calculate as much and we can before expanding to the
scenario_geography
andequity_market
values. @jdhoffa @jacobvjk @AlexAxthelmrelated RMI-PACTA/pacta.data.preparation#7