Closed geneorama closed 8 years ago
BTW, to see what I mean, checkout branch iss91
, and run 30_glmnet_model.R
to line 146, then step through the code.
@geneorama there is only two lines of shared code between the two functions and that code is not worth pulling out into a separate function.
I have no other comments. I looked over the code, but did not run it.
@cash good idea on the name change. After some more research I think "labels" might make even more sense, but positives is an improvement over pos.
Do you have any ideas for the other output? Not sure how to make this summary more intuitive:
Some places I looked for inspiration:
BTW, it's fun to play around with some metrics from the ROCR package.
##==============================================================================
## Metrics with ROCR Package
##==============================================================================
## computing a simple ROC curve (x-axis: fpr, y-axis: tpr)
geneorama::loadinstall_libraries("ROCR")
predTest <- prediction(datTest$score, datTest$criticalFound)
## precision / recall
plot(performance(predTest, "prec", "rec"), main="precision recall")
# ROC
plot(performance(predTest, "tpr", "fpr"), main="ROC")
abline(0, 1, lty=2)
## sensitivity / specificity
plot(performance(predTest, "sens", "spec"), main="sensitivity vs specificity")
abline(1, -1, lty=2)
## phi
plot(performance(predTest, "phi"), main="phi scores")
## Fancy ROC curve:
op <- par(bg="lightgray", mai=c(1.2,1.5,1,1))
plot(performance(predTest,"tpr","fpr"),
main="ROC Curve", colorize=TRUE, lwd=10)
par(op)
## Effect of using a cost function on cutoffs
plot(performance(predTest, "cost", cost.fp = 1, cost.fn = 1),
main="Even costs (FP=1 TN=1)")
plot(performance(predTest, "cost", cost.fp = 1, cost.fn = 4),
main="Higher cost for FN (FP=1 TN=4)")
## Accuracy
plot(performance(predTest, measure = "acc"))
## AUC
performance(predTest, measure = "auc")@y.values[[1]]
I'm coming from a machine learning background, so ROC curves and AUC are what I'm used to looking at. That fancy ROC curve did a really nice job of showing the distribution of the scores. I hadn't noticed the skew in that distribution before.
This PR is enough of an improvement over what's in master that I recommend doing any clean up that is needed (variable names, typos like "Caluclate") and then merge it in. Adding some nice plots could be an additional pull request.
@cash I was really hoping to get some feedback on the column headers, like POSTOT_SIM
The name components are supposed to signal the following:
component | meaning |
---|---|
POS | Positives |
TOT | Running total |
SIM | simulated results. |
but, I think the current names are unintuitive and ugly, so any suggestions are welcome.
We've accepted the pull request from @cash that does the model evaluation a lot more directly than the previously. However, I'd like to streamline it a bit more, and make the functions even more general for use elsewhere. So, I've broken
eval_model
into two functions calledsimulated_date_diff_mean
andsimulated_bin_summary
I'm looking for comments in general, but specifically on the issues that
simulated_bin_summary
has obnoxious column titles. Any ideas for better names / labels?Feel free to comment on other problems, I'm just mentioning two things that are already on my mind.
Thanks!
Gene
Also tagging: @tomschenkjr @cash and @fgregg (if you have time / interest)