HealthCatalyst / healthcareai-r

R tools for healthcare machine learning
https://docs.healthcare.ai
Other
245 stars 106 forks source link

AUPR is wrong in summary #1235

Closed michaellevy closed 6 years ago

michaellevy commented 6 years ago

AUROC is fine, but AUPR is wrong in summary.model_list and its dependencies, which include plot.model_list and evaluate.model_list.

Don't know what the numbers are. They're not AUROC. They seem too high (> .9 on hard prediction tasks) to be from final-model predictions on the training data.

library(healthcareai)
pd <- prep_data(pima_diabetes, patient_id, outcome = diabetes) 
roc_models <- tune_models(pd, diabetes, tune_depth = 2, models = "RF")
pr_models <- tune_models(pd, diabetes, tune_depth = 2, models = "RF", metric = "PR")

# AUROC = .84 in print. and summary.
roc_models
#> Algorithms Trained: Random Forest
#> Model Name: diabetes
#> Target: diabetes
#> Class: Classification
#> Performance Metric: AUROC
#> Number of Observations: 768
#> Number of Features: 12
#> Models Trained: 2018-08-24 18:52:29 
#> 
#> Models tuned via 5-fold cross validation over 2 combinations of hyperparameter values.
#> Best model: Random Forest
#> AUPR = 0.69, AUROC = 0.83
#> Optimal hyperparameter values:
#>   mtry = 6
#>   splitrule = extratrees
#>   min.node.size = 3
summary(roc_models)
#> Models trained: 2018-08-24 18:52:29
#> 
#> Models tuned via 5-fold cross validation over 2 combinations of hyperparameter values.
#> Best performance: AUPR = 0.69, AUROC = 0.83
#> By Random Forest with hyperparameters:
#>   mtry = 6
#>   splitrule = extratrees
#>   min.node.size = 3
#> 
#> Out-of-fold performance of all trained models:
#> 
#> $`Random Forest`
#> # A tibble: 2 x 9
#>    mtry splitrule  min.node.size AUROC  Sens  Spec  ROCSD SensSD SpecSD
#> * <int> <chr>              <int> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
#> 1     6 extratrees             3 0.833 0.844 0.597 0.0273 0.0365 0.0643
#> 2     1 gini                  13 0.830 0.958 0.347 0.0406 0.0217 0.0844

# AUPR = .7 in print. but .9 in summary.
pr_models
#> Algorithms Trained: Random Forest
#> Model Name: diabetes
#> Target: diabetes
#> Class: Classification
#> Performance Metric: AUPR
#> Number of Observations: 768
#> Number of Features: 12
#> Models Trained: 2018-08-24 18:52:32 
#> 
#> Models tuned via 5-fold cross validation over 2 combinations of hyperparameter values.
#> Best model: Random Forest
#> AUPR = 0.69, AUROC = 0.83
#> Optimal hyperparameter values:
#>   mtry = 4
#>   splitrule = gini
#>   min.node.size = 8
summary(pr_models)
#> Models trained: 2018-08-24 18:52:32
#> 
#> Models tuned via 5-fold cross validation over 2 combinations of hyperparameter values.
#> Best performance: AUPR = 0.69, AUROC = 0.83
#> By Random Forest with hyperparameters:
#>   mtry = 4
#>   splitrule = gini
#>   min.node.size = 8
#> 
#> Out-of-fold performance of all trained models:
#> 
#> $`Random Forest`
#> # A tibble: 2 x 11
#>    mtry splitrule min.node.size  AUPR Precision Recall     F  AUCSD
#> * <int> <chr>             <int> <dbl>     <dbl>  <dbl> <dbl>  <dbl>
#> 1     4 gini                  8 0.889     0.787  0.848 0.816 0.0146
#> 2    10 gini                 12 0.880     0.795  0.838 0.816 0.0183
#> # ... with 3 more variables: PrecisionSD <dbl>, RecallSD <dbl>, FSD <dbl>

Created on 2018-08-24 by the reprex package (v0.2.0).

michaellevy commented 6 years ago

Actually evaluate is fine because we calculate the performance metrics ourselves for evaluate. For evaluate.model_list we actually calculate them in model_list creation and attach them as attr(model_list, "performance").

Whatever the issue is, it seems to be in caret's calculation of AUPR. We go to great lengths to calculate our training metrics exactly as caret does (see the nested map calls in the middle of evaluate.model_list, and ours' and caret's match for all three regression metrics and AUROC; it's just AUPR, and caret's numbers are ludicrously high.

I propose we issue warning about this if the user calls summary.model_list or plot.model_list on a PR-optimized model. Not sure what else we can do. @mmastand @taylorlarsen ?

Here is the raw caret result vs ours. Note that caret calls AUPR "AUC" (which we later change, but this is straight from caret):

> pr_models$`Random Forest`$results[1, ]
  mtry splitrule min.node.size       AUC Precision Recall         F      AUCSD PrecisionSD   RecallSD        FSD
1    1      gini             4 0.8955753 0.7356974  0.956 0.8311848 0.02265933   0.0405445 0.01516575 0.03087087
> evaluate(pr_models)
     AUPR     AUROC 
0.7174850 0.8461132 
michaellevy commented 6 years ago

print.predicted_df prints the wrong performance metric too. It gets it from attr(predicted_df, "model_info")$performance, which obviously gets attached in predict and there comes from extract_model_info(model_list)$best_model_perf. extract_model_info gets that from model_list[[1]]$results, which -- as noted in the above comment -- is where caret has the crazy AUPR values. So, in extract_model_info instead of taking best_model_perf from best_metrics[[best_model]], we could take it from evaluate(x). Yes, doing that.

michaellevy commented 6 years ago

Okay, print.predicted_df is fixed. But the summary.model_list and plot.model_list issue remains.

michaellevy commented 6 years ago

I just figured this out. It's that caret uses the reference level of the outcome as the positive class, but we do the opposite. Should be a relatively straightforward fix now.

library(caret)
library(MLmetrics)
library(tidyverse)
set.seed(2695)
data(mtcars)
x <- select(mtcars, -am)
# The factor-levels order here is key. Flip the levels and you get the wrong
# result that our package currently produces
y <- factor(ifelse(mtcars$am, "Y", "N"), levels = c("Y", "N"))
tc <- trainControl(method = "cv",
                   number = 3, 
                   classProbs = TRUE,
                   summaryFunction = prSummary,
                   search = "random",
                   savePredictions = "final")
tr <- train(x = x,
            y = y,
            method = "ranger",
            metric = "AUC",
            trControl = tc)

# caret's AUPR
tr$results$AUC[1]
#> [1] 0.7666667

# Manually calculated AUPR
tr$pred %>% 
  group_by(Resample) %>% 
  summarize(aupr = PRAUC(y_pred = Y, y_true = ifelse(obs == "Y", 1, 0))) %>% 
  pull(aupr) %>% 
  mean()
#> [1] 0.7666667

Created on 2018-08-27 by the reprex package (v0.2.0).

michaellevy commented 6 years ago

Tricky thing about making the outcome's positive class the reference level: glmnet assumes opposite, so coefficients from a glmnet model trained by caret are opposite. That's caret's problem, not ours, so I think I'll just make the hacky fix of flipping the signs of coefficients for classification models in interpret.