Pkg.add("ROCAnalysis")
Receiver Operating Characteristic Analysis functions for evaluation probabilistic binary classifiers. Allows efficient plotting of ROC curves, and many more things.
Please note there is an alternative implementation under a similar name, and support for ROC analysis also exists in MLBase.
Our implementation is more geared towards:
The development roadmap is largely based on the functionality in a similar R package ROC.
An annotated jupyter notebook of the code below can be found here.
using ROCAnalysis
## Produce some well-calibrated log-likelihood-ratio scores for target and non-target class:
tar = 2 .+ 2randn(1000)
non = -2 .+ 2randn(100000)
## quick estimate of equal error rate, should be close to pnorm(-1) = 0.5 + 0.5erf(-1/√2)
eer(tar, non)
## compute full ROC statistics
r = roc(tar, non)
## accurate computation of the equal error rate, using the convex hull
eerch(r)
## roc plot, we plot errors (false negatives against false positives) rather than hits vs. false alarms.
using Plots ## or perhaps another plotting package
plot(r)
## The "Detection Error Tradeoff" plot, this should give a more/less straight line
detplot(r)
## compute the Area Under the ROC, should be close to 0.078
auc(r)
## define a decision cost function by its parameter p_tar=0.01, Cfa=1, Cmiss=10 (NIST SRE 2008 setting)
d = DCF(0.01, 1, 10)
## `actual costs' using a threshold of scores at -plo(d) (minus prior log odds)
plo(d)
dcf(tar, non, d=d)
## Or use a different threshold, e.g., zero
dcf(tar, non, d=d, thres=0)
## `minimal costs' using an optimal threshold
mindcf(r, d=d)
## define an array of DCFs, and compute the decision costs for these, using a threshold at -plo
d = DCF([0.001, 0.01, 0.1, 0.5, 0.9, 0.99, 0.999], 1, 1)
dcf(tar, non, d=d)
## same, but normalized to costs of decisions based on the prior alone
dcf(tar, non, d=d, norm=true)
## prior log odds, thre crucial combination of cost parameters, log(p_tar / (1-p_tar) Cmiss / Cfa)
plo(d)
## now scan the Bayes error rate (similar to the dcf above) for a range of prior log odds, and plot
## This is known as the Applied Probability of Error plot
apeplot(r)
## The area under the red curve (actual error rates), the cost of the log-likelihood ratio
cllr(tar, non)
## The area under the green curve (minimum errors), the cost of the optimal log-likelihood-ratio
mincllr(tar, non)
## Similar to APE, but normalized---a Normalized Bayes Error plot
nbeplot(r)
## Make an `LLR' plot: score-to-optimal-LLR mapping, r.θ, vs. r.llr
llrplot(r)
A binary classifier maps an input x
to one of two classes, A
and B
. Internally, every classifier ends up producing some form of a scalar metric s
, which can be thresholded to produce a decision.
There are two possible "senses" of this internal scalar:
s
indicate higher probability of A
s
indicate higher probability of B
There are of course also many different interpretations of the classes A
and B
. For instance, in biometrics B
could mean "same individual" and A
"different individual". The corresponding senses of s
then have an interpretation
Because we want to focus in this package on a probabilistic interpretation of the scalar s
, we take the "score-like" interpretation of s
, i.e., higher values of s
correspond to a higher likelihood of the class-of-interest to be associated to the input of the classifier. If your classifier is, in fact, a distance metric d
, you could work with s = -d
or s = 1/d
or any other strictly decreasing function. Alternatively, you can swap around the label of the class of interest.
As the name suggests, a classifier is supposed to make decisions. Decisions can be thesholded against a fixed θ
such that:
s > θ
, decide class B
s < θ
, decide class A
For evaluating the classifier, we need a set of supervised trials, i.e., for each scalar score s
we need a label indicating the true class of the trial that led to score s
. Because there are two classes, two types of errors can be made:
s > θ
while in fact the true label is A
s < θ
while in fact the true label is B
.The Receiver Operating Characteristic (ROC) is a graph that shows how the fractions of the false positives and false negatives change with varying θ
, for a fixed set of scores s
. In ROCAnalysis
, the structure of type Roc
captures the essential information in a pre-processed way such that other quantities can be derived efficiently.
Because I come from automatic speaker recognition, I tend to use the following terminology for the classes
In this package, we focus at analysing everything in terms of error rates. Traditionally, researchers have used ROC curves where one of the axes is used to describe the complement of an error rate. Specifically, one often sees a true positive rate versus a false positive rate, where the true positive rate is the complement of the false negative rate. There is no real objective way to judge whether one analysis is any better than the other, usually the choice is largely dependent on traditions in the area of the research you are in.
There are many different names of the error rates in different scientific disciplines. Because I come from the area of automatic speaker recognition, the current terminology is
We foresee that the naming of things becomes a bit more flexible in future releases of this package.
A detection error trade-off plot (Martin et. al., 1997) is exactly the same as a ROC plot in the error domain (i.e., miss rate vs false alarm rate), but the axes have been warped according to Φ⁻¹(x)
, the inverse of the cumulative normal distribution. In R, this function is known as qnorm(x)
, in Julia base this is √2 * erfinv(2x - 1)
. This type of plot has interesting properties
σ
parameters of the underlying Normal score distributions, namely that of the non-target scores divided by that of the target scores. Often, highly discriminative classifiers show very flat curves, indicating that that target scores have a much larger variance than the non-target scores. y = x
in a DET plot corresponds linearly to a quantity known as d'
(d-prime) from psychophysics, ranging from 0 at 50% error to about 6 at 0.1% error. There is an essential difference between discrete score (classes) and continuous scores. For the former, trials with the same scores must be grouped before the probabilities of false alarm and miss are computed. This results in ROC and DET plots that can have line elements that are not solely horizontal or vertical. This is contrary to the latter case if we assume that no two scores are (coincidentally) the same, which leads to only horizontal and vertical line segments. This ROCAnalysis
package makes sure that the occurrence of identical scores is treated correctly by sorting target scores before identical non-target scores, and by treating trials with scores at the threshold always as errors.
For larget trial sets, it is very likely that in the extremes of the score distributions there is very little overlap. This will results in many consecutive horizontal or vertical line segments in the plot. This ROCAnalysis
package integrates these consecutive line segments and replaces them by a single segment, which leads to a strong reduction in complexity in further calculations and plotting.
The ROC and DET plots shows the discrimination capability of the detector as a graph. Often one wants to summarize the plot in a single metric. There are many ways to do this, we list some here
y = x
line, i.e., where the error rates are the same. A lower EER means a better discriminating classifier. It samples the ROC on only a single operating point, and moreover, this is an "after-the-fact" point, meaning it can be determined only after all scores for a set of trials have been computed by the classifier. This metric is insensitive to calibration, i.e., any strictly increasing function can be applied to the scores and an identical EER will be computed. For small number of trials it makes a different how to compute the EER, often in literature this is not specified. In this package there are the functions:
eer()
: take the points where the difference between miss and false alarm rates changes sign, and interpolate to find the crossing with the y = x
diagonal.eerch()
: compute the convex hull of the ROC, and compute the point where this crosses the diagonal y = x
. This has the interpretation of the maximum over priors of the minimum cost, and is useful for cost function analysis. It tends to be more stable than eer()
.auc()
integrates the Area Under the Curve of the ROC. Contrary to the EER, this metric is sensitive to the entire range of operating points, but, like the EER, it is insensitive to calibration. It can be interpreted as the probability that a random non-target score is higher than a random target score, and lower AUC indicates better discrimination. Please note that the complement (area under the hit-rate-vs-false-alarm-rate curve) is known under the same name in other disciplines. We believe in errors, and minimizing them.dcf()
A Detection cost function is a weighted sum of false alarms and misses. The weights consists of separate costs for false alarms and misses, and a prior for non-targets and targets.mindcf()
This is the minimum detection cost dcf()
that can be obtained by varying the threshold. It is similar to the EER in the sense that it requires setting a threshold "after the fact". The minimum DCF is related to the points on the convex hull of the ROC curve. cllr()
computes a normalized form of the cross-entropy between the "true posterior" (1
for target trials and 0
for non-target trials) and the posterior resulting from the classifier score when interpreted as a likelihood ratio, and using a prior for the classes of 0.5. This measure is calibration sensitive, i.e., it penalizes under- or over-confident likelihood ratios. The minimum value is determined by the discriminative properties of the classifier, and this minimum approaches 0 for a classifier that completely separates the two classes. A value of 1 indicates that the classifier gives no information, i.e., decisions can just as well be made based on the prior only. A value larger than 1 indicates that making Bayes's decisions based on the classifiers score gives a higher expected cost than basing decisions on the prior alone.mincllr()
computes the minimum attainable Cllr by warping the scores to log-likelihood-ratios while maintaining the order of the scores. This is equivalent to determining a minimum cost for all cost functions that can be written as a linear combination of actual miss- and false-alarm-rates, and integrating these costs over cost function parameters.We have defined the following types:
TNT(tar, non)
, A container for arrays of target and non-target scores. Anywhere it says tar, non
you should be able to substitute a variable of type TNT
.DCF(ptar, cfa, cmiss)
. A container for scalars or arrays of the cost function parameters ptar
(the prior probability of a target), cfa
(the cost of a false alarm) and cmiss
(the cost of a miss). In any of the parameters are arrays, they must be of compatible size.Roc(pfa, pmiss, θ, chull, llr)
. The basic structure for storing ROC results. It contains the x- and y-coordinates of the (error-based) ROC, together with the threshold. Many of the "minimum" performance measures can be computed from this. The fields in Roc
have the following interpretation:
pfa
. The probability of a false alarm, the fraction of non-target scores above (or equal to) the threshold θ
. This quantity is also known as false alarm/accept/positive/match rate or type II error, and 1-pfa
is known as the true negative/reject rate or specificity.pmiss
. The probability of a miss, the fraction of target scores below (or equal to) the threshold θ
. This quantity is also known, a.o. miss rate, false negative/reject/non-match rate, type I error, and 1-pmiss
is also known as true positive rate, sensitivity, recall, verification/hit rate, genuine acceptance rate.θ
. The threshold associated with the elements in pfa
and pmiss
. It can be seen as lying between this pfa/miss
and the next. Hence, there is one value less than in the pmiss
and pfa
arrays.chull
. A boolean indicating whether or not the point (pfa, pmiss)
lies on the convex hull.llr
. The optimal log-likelihood-ratio associated with scores at θ
. This is very much work in progress. If you stumble upon this, please drop me a line.