predict_counterfactual function

michaellevy commented 6 years ago

Need a better name. Let the user specify any number of columns and generate predictions from the best model across values for those columns (all levels of factors, maybe 5th to 95th percentile of numerics) and at mean/mode for other columns.

Could have a plot method for the output that puts the changing variables on x, color, facet depending on type.

michaellevy commented 6 years ago

I could imagine an awesome shiny app sitting on top of this that lets the user select variables to split predictions across.

michaellevy commented 6 years ago

From @taylorlarsen

I’d be interested to chat through which “patient” or “observation“ we run the counterfactuals for. The average patient, the median patient, a real individual that we perturb values for? Keeping in mind that it would be nice for both model explanation and individual patient/provider conversations (maybe we only allow certain scenarios and have limitations on individual patients).

That's a great point about counterfactual predictions being useful for both model-level interpretation and "what if this patient were ten pounds lighter?" type questions (which is similar to pip, but a slightly different angle on the same thing).

I propose that the default selects a handful of most-important variables and makes predictions across those holding others at their medians. Three levels of customization there could be available to the user: 1. Choosing the number of variables, 2. Choosing which variables, or 3. Choosing which values of those variables use to use. That's all model-level. For user-level, the same three levels of customization could be available, but instead of using the medians of other values, the user could provide an identifier value, and we'd use that observation's values for all the not-changing variables.

michaellevy commented 6 years ago

choose_variables will select the names of most-important variables to use, if needed

To do:

[x] write choose_values to choose the values of selected variables
- The hard part here will be recreating categoricals from the dummies in trainingData. Unless we just say that we're going to use all the levels of the variable. In which case the user can easily filter later if they want. That seems a lot easier. Then it's just using the subset of attributes(rec)$factor_levels that are predictors. Maybe, but we have to find the modal value anyway for non-varying categories, so maybe it's worth reconstructing the column.
[x] add_static_columns to choose the values of non-varying variables and add them to the varying DF.
- I think hold should take either a list of functions or a row of the training data frame (or a list/data frame of values to use for the held variables), rather than an integer index because this function is far away from the training data, so it makes sense to force the user to go back to the training data, ala filter(pima_diabetes, patient_id == 14)
[x] Make predictions, class the DF, etc.
[x] Unit tests for all these functions.
[ ] rename to avoid conflict with stats::simulate
[ ] Improve test coverage, which is almost entirely errors not being hit on tests

HealthCatalyst / healthcareai-r

predict_counterfactual function #881