Closed michaellevy closed 6 years ago
Actually evaluate
is fine because we calculate the performance metrics ourselves for evaluate. For evaluate.model_list
we actually calculate them in model_list
creation and attach them as attr(model_list, "performance")
.
Whatever the issue is, it seems to be in caret's calculation of AUPR. We go to great lengths to calculate our training metrics exactly as caret does (see the nested map
calls in the middle of evaluate.model_list
, and ours' and caret's match for all three regression metrics and AUROC; it's just AUPR, and caret's numbers are ludicrously high.
I propose we issue warning about this if the user calls summary.model_list
or plot.model_list
on a PR-optimized model. Not sure what else we can do. @mmastand @taylorlarsen ?
Here is the raw caret result vs ours. Note that caret calls AUPR "AUC" (which we later change, but this is straight from caret):
> pr_models$`Random Forest`$results[1, ]
mtry splitrule min.node.size AUC Precision Recall F AUCSD PrecisionSD RecallSD FSD
1 1 gini 4 0.8955753 0.7356974 0.956 0.8311848 0.02265933 0.0405445 0.01516575 0.03087087
> evaluate(pr_models)
AUPR AUROC
0.7174850 0.8461132
print.predicted_df
prints the wrong performance metric too. It gets it from attr(predicted_df, "model_info")$performance
, which obviously gets attached in predict
and there comes from extract_model_info(model_list)$best_model_perf
. extract_model_info
gets that from model_list[[1]]$results
, which -- as noted in the above comment -- is where caret has the crazy AUPR values. So, in extract_model_info
instead of taking best_model_perf
from best_metrics[[best_model]]
, we could take it from evaluate(x)
. Yes, doing that.
Okay, print.predicted_df
is fixed. But the summary.model_list
and plot.model_list
issue remains.
I just figured this out. It's that caret uses the reference level of the outcome as the positive class, but we do the opposite. Should be a relatively straightforward fix now.
library(caret)
library(MLmetrics)
library(tidyverse)
set.seed(2695)
data(mtcars)
x <- select(mtcars, -am)
# The factor-levels order here is key. Flip the levels and you get the wrong
# result that our package currently produces
y <- factor(ifelse(mtcars$am, "Y", "N"), levels = c("Y", "N"))
tc <- trainControl(method = "cv",
number = 3,
classProbs = TRUE,
summaryFunction = prSummary,
search = "random",
savePredictions = "final")
tr <- train(x = x,
y = y,
method = "ranger",
metric = "AUC",
trControl = tc)
# caret's AUPR
tr$results$AUC[1]
#> [1] 0.7666667
# Manually calculated AUPR
tr$pred %>%
group_by(Resample) %>%
summarize(aupr = PRAUC(y_pred = Y, y_true = ifelse(obs == "Y", 1, 0))) %>%
pull(aupr) %>%
mean()
#> [1] 0.7666667
Created on 2018-08-27 by the reprex package (v0.2.0).
Tricky thing about making the outcome's positive class the reference level: glmnet
assumes opposite, so coefficients from a glmnet model trained by caret are opposite. That's caret's problem, not ours, so I think I'll just make the hacky fix of flipping the signs of coefficients for classification models in interpret
.
AUROC is fine, but AUPR is wrong in
summary.model_list
and its dependencies, which includeplot.model_list
andevaluate.model_list
.Don't know what the numbers are. They're not AUROC. They seem too high (> .9 on hard prediction tasks) to be from final-model predictions on the training data.
Created on 2018-08-24 by the reprex package (v0.2.0).