Closed michaellevy closed 6 years ago
I could imagine an awesome shiny app sitting on top of this that lets the user select variables to split predictions across.
From @taylorlarsen
I’d be interested to chat through which “patient” or “observation“ we run the counterfactuals for. The average patient, the median patient, a real individual that we perturb values for? Keeping in mind that it would be nice for both model explanation and individual patient/provider conversations (maybe we only allow certain scenarios and have limitations on individual patients).
That's a great point about counterfactual predictions being useful for both model-level interpretation and "what if this patient were ten pounds lighter?" type questions (which is similar to pip
, but a slightly different angle on the same thing).
I propose that the default selects a handful of most-important variables and makes predictions across those holding others at their medians. Three levels of customization there could be available to the user: 1. Choosing the number of variables, 2. Choosing which variables, or 3. Choosing which values of those variables use to use. That's all model-level. For user-level, the same three levels of customization could be available, but instead of using the medians of other values, the user could provide an identifier value, and we'd use that observation's values for all the not-changing variables.
choose_variables
will select the names of most-important variables to use, if needed
To do:
choose_values
to choose the values of selected variables
trainingData
. Unless we just say that we're going to use all the levels of the variable. In which case the user can easily filter later if they want. That seems a lot easier. Then it's just using the subset of attributes(rec)$factor_levels
that are predictors. Maybe, but we have to find the modal value anyway for non-varying categories, so maybe it's worth reconstructing the column.add_static_columns
to choose the values of non-varying variables and add them to the varying DF.
hold
should take either a list of functions or a row of the training data frame (or a list/data frame of values to use for the held variables), rather than an integer index because this function is far away from the training data, so it makes sense to force the user to go back to the training data, ala filter(pima_diabetes, patient_id == 14)
stats::simulate
Need a better name. Let the user specify any number of columns and generate predictions from the best model across values for those columns (all levels of factors, maybe 5th to 95th percentile of numerics) and at mean/mode for other columns.
Could have a plot method for the output that puts the changing variables on x, color, facet depending on type.