Evaluating/validating the algorithm/predictions

SiminaB commented 5 years ago

The "Ten Quick Tips" cited in #1 has a number of rules related to this. As with other issues/potential questions, most of these are related to ML in general, but should at least be mentioned for DL. The rules related to this in "Ten Quick Tips" are:

Split the input dataset into training, validation, test sets
Take care of the imbalanced data problem
Optimize each hyper-parameter
Minimize overfitting
Evaluate algorithm with the MCC and precision-recall curve

I think we only need 1-2 rules related to this. One issue that has been coming up recently is that of bias in training data eg https://arxiv.org/abs/1711.08536 and http://www.pnas.org/content/115/16/E3635.short. Not sure if that would fit in here or if we should have that as a separate rule relating to generalizability or if it relates enough to #9 that it can just fit in there.

agitter commented 5 years ago

All of those bulleted rules above are really essential. Arguably even more essential than some of the DL-specific rules discussed here.

Perhaps one of the rules should be "don't forgot all of the best practices for general ML in biomedicine" with a citation to #1 and some discussion.

SiminaB commented 5 years ago

Yep, I agree! Just wondering if there's anything DL-specific to add. Like "be even more careful about overfitting." The bias issue is also not DL-specific but it can get exacerbated as ML methods are increasingly applied to an increasing number of datasets.

beamandrew commented 5 years ago

I think this really worth expanding, especially in the context of potential healthcare applications. Many times the evaluation strategies are ported directly from the ML/CV communities which don't usually make sense for bio/medical applications. Things like area under the precision-recall curve, positive predictive value, etc are often left out. I think the big idea is that you should evaluate a deep learning model just like you would any other kind of statistical model. Using deep learning doesn't get you of the hook for using sound statistical methodology.

beamandrew commented 5 years ago

Adding to this I would like to say that we should encourage people to cross validate when possible. For many datasets this is not only possible but preferable to a traditional train/val/test split.

hugoaerts commented 5 years ago

+1 on the first comment from Andrew.

About the cross-validation: Yes, for training and tuning datasets. However, independent testing datasets that are only evaluated once, are mandatory for the highest level of evidence.

Benjamin-Lee commented 5 years ago

@gokceneraslan writes (via email): "Know your evaluation metrics: Don't use AUROC for imbalanced data, use AUPRC instead and report how imbalanced your data is."

michaelmhoffman commented 5 years ago

Precision-recall analysis is essential. But auPR is not a panacea. Like auROC much of derives from performance in areas of the curve that matter little. In other words, who cares what the recall is when precision is <50%? Unrealistic in most deployment scenarios.

In the end if you try to boil everything down to one number you're in trouble. See our Virtual ChIP-seq paper for an example of showing performance with a lot of different metrics, such as F1, accuracy, MCC, auROC, and auPR. Recall at fixed precision (e.g. 90% precision or 95% precision) can be very useful too.

rasbt commented 5 years ago

Precision-recall analysis is essential. But auPR is not a panacea. Like auROC much of derives from performance in areas of the curve that matter little. In other words, who cares what the recall is when precision is <50%? Unrealistic in most deployment scenarios.

Totally agree with that.

(However, this is basically a general data mining topic. In general, I think while the rules so far sound great, we should maybe generally try to steer more towards deep learning and biology (reg. the topic of the paper).)

In any case, another related point is a) model comparison and b) algorithm comparison (since we are usually interested in determining how much better DL does compared to other methods). For algorithm comparisons, both the 5x2cv F1 test and nested cross-validation are maybe things we want to mention, and how this is challenged by DL reg. computational efficiency (maybe McNemar's test, with a decent FP rate as a recommendation).

fmaguire commented 5 years ago

Fairly covered in https://github.com/Benjamin-Lee/deep-rules/blob/master/content/03.ml-concepts.md

Biases in testing data can also unduly influence measures of model performance. For example, many conventional metrics for classification (e.g. area under the receiver operating characteristic curve or AUROC) have limited utility in cases of extreme class imbalance [@pmid:25738806]. As such, model performance should be evaluated with a carefully-picked panel of relevant metrics that make minimal assumptions about the composition of the testing data [@doi:10.1021/acs.molpharmaceut.7b00578].

Benjamin-Lee / deep-rules

Evaluating/validating the algorithm/predictions #16