HealthCatalyst / healthcareai-r

R tools for healthcare machine learning
https://docs.healthcare.ai
Other
245 stars 106 forks source link

Function to identify factor levels most likely to be informative #1078

Closed michaellevy closed 6 years ago

michaellevy commented 6 years ago

We're often faced with variables with sufficient cardinality that we cannot use all the levels due to computational limits. One solution is to use N-most common levels, but most common is unlikely to mean most predictive.

This function would take two tables: one with the factor column and an ID, the other with the outcome column and ID (in separate tables because these usually consist of multiple observations per unit, e.g. multiple meds per patient, though it would be nice to alternatively be able to take them in the same table). Then "loop" over category levels and tabulate variance in the outcome across each level. The levels that are associated with low variance in the outcome should be strong predictors. Could return the user-specified-N best-splitting levels, or could return a data.frame with levels as one column and variance as the other so the user could plot variance and see where it drops off.

michaellevy commented 6 years ago

I need to do this as a one-off for HPH HealthCatalyst/ml.internal/issues/1057 anyway, so might as well take the opportunity to productize it.

michaellevy commented 6 years ago

Thought about creating a numeric feature that is the average outcome for each grouper, but that only works if each grain only has one grouper. E.g. If we want to bring meds info to a patient-level prediction of mortality, we could calculate the proportion of patients getting asprin that died, but what's the value for a patient who got asprin and dexamethasone? Could do summary numeric columns: mean med survival, min med survival, max med survival, number meds. Or could go back to finding the meds that are most strongly correlated with death and survival and use those as dummies.

michaellevy commented 6 years ago

For classification, I'm weighting log distance from being present in every record and log distance from perfectly separating outcomes equally (they're multiplied in a line that assigns the "badness" variable). Could play with weighting them differently and testing performance.