BlockScience / PocketSimulationModel

2 stars 3 forks source link

(For reference) Standardized simulation result DataFrame for sensitivity analysis #308

Closed jshorish closed 10 months ago

jshorish commented 10 months ago

The sensitivity analysis / feature importance library cadCAD_machine_search requires that the simulation results DataFrame, hereafter called df, be in a particular post-processed state before threshold KPIs can be measured. This issue is just a placeholder for the operations to achieve this state (and can be ignored if these are already in place):

Removal of substeps (often standard, repeated here to be exhaustive)

These lines just remove the intermediate PSUB substeps from df.

first_ind = (df.substep == 0) & (df.timestep == 0)
last_ind = df.substep == max(df.substep)
inds_to_drop = (first_ind | last_ind)
df = df.loc[inds_to_drop].drop(columns=['substep'])

Addition of control parameter constellations as columns

Each simulation output row should have one column per control parameter, containing the value of the parameter from the parameter 'constellation' vector for that row. This may be already handled by the simulation workflow. If not, one way to add them in post-processing is:

  1. Retrieve the populated configs object from the cadCAD simulation for this df (this is usually initialized via from cadCAD import configs). This contains, for each run i = 0, 1, ... in configs[i], all of the parameter constellation information in the sim_config dictionary attribute;
  2. Assign the parameters to each row of the df:
    df = df.assign(**configs[0].sim_config['M'])
    for i, (_, n_df) in enumerate(df.groupby(['simulation', 'subset', 'run'])):
    df.loc[n_df.index] = n_df.assign(**configs[i].sim_config['M'])

    The result is that df has new columns labeled by the control parameter labels, with associated parameter values for each control parameter that corresponds to the runs of each row.

SeanMcOwen commented 10 months ago

I believe this is all handled in post processing already. I map every parameter into the results data frame by adding param_ before them

jshorish commented 10 months ago

Just to add to this as a running list of compatibility requirements, also required for the sensitivity analysis workflow is:

  1. An unambiguous list of the exact column labels for each sweep parameter in the cadCAD scenario results df; and
  2. An unambiguous list of the exact column labels for the computed KPIs in the df.

Both of these lists are used to pick out the right control (sweep) parameters and KPIs to be used for the threshold inequalities in the sensitivity analysis exercise.

[If there's a location in the codebase where these column labels are already defined then that's great, and this can even be exploited in the future to automatically pull them from that 'single source of truth' into the sensitivity analysis workflow.]