HealthCatalyst / healthcareai-r

R tools for healthcare machine learning
https://docs.healthcare.ai
Other
245 stars 106 forks source link

get_variable_importance bug #1098

Closed michaellevy closed 6 years ago

michaellevy commented 6 years ago
m <- machine_learn(pima_diabetes, patient_id, outcome = pregnancies)
get_variable_importance(m)

# A tibble: 14 x 2
   variable                importance
 * <chr>                        <dbl>
 1 plasma_glucose                 NaN
 2 diastolic_bp                   NaN
 3 skinfold                       NaN
 4 insulin                        NaN
 5 pedigree                       NaN
 6 age                            NaN
 7 weight_class_normal            NaN
 8 weight_class_obese             NaN
 9 weight_class_overweight        NaN
10 weight_class_other             NaN
11 weight_class_missing           NaN
12 diabetes_Y                     NaN
13 diabetes_other                 NaN
14 diabetes_missing               NaN
michaellevy commented 6 years ago

It's only when the splitrule is maxstat, which is only an option for regression. We could use permutation importance instead (I remember intentionally choosing impurity but I don't remember why), or we could limit regression splitrule to extratrees or variance. Waiting to hear back here before making changes.

michaellevy commented 6 years ago

Impurity importance for maxstat splitting is just now being added to ranger. In my experience maxstat is far less effective than variance or extratrees, and tuning mtry is more important, so I'm just going to take out maxstat splitting.