Closed mikemahoney218 closed 1 year ago
Looking at the source of grDevices::boxplot.stats
, I see:
> grDevices::boxplot.stats
function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE)
{
if (coef < 0)
stop("'coef' must not be negative")
nna <- !is.na(x)
n <- sum(nna)
stats <- stats::fivenum(x, na.rm = TRUE)
iqr <- diff(stats[c(2, 4)])
if (coef == 0)
do.out <- FALSE
else {
out <- if (!is.na(iqr)) {
x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef *
iqr)
}
else !is.finite(x)
if (any(out[nna], na.rm = TRUE))
stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
}
conf <- if (do.conf)
stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
list(stats = stats, n = n, conf = conf, out = if (do.out) x[out &
nna] else numeric())
}
The relevant lines here are:
out <- if (!is.na(iqr)) {
x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
}
#> [...]
stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
In context, that means that the AOA threshold winds up equaling:
max(di[!(di > (quantile(di, 0.75) + 1.5 * IQR(di)))])
Which means that the threshold is going to be the value in di closest to, but not more than quantile(di, 0.75) + (1.5 * IQR(di))
, which, particularly for smaller data, may be a significantly different value. The returned value will always be lower than (or the same as) the 75th percentile plus 1.5 times the IQR. Is this expected behavior?
Thanks for finding that! I fixed it according to your suggestion.
Hi all,
Adapting some code from the MEE-AOA repo, I believe I can calculate an AOA like this:
According to the 2021 paper, I believe the AOA threshold after this should be equal to "the 75-percentile plus 1.5 times the IQR of the DI values of the cross-validated training data". Calculating that using
quantile
andIQR
gives us these results:But the AOA threshold returned by
aoa()
doesn't match that calculation:If I'm right and this is unexpected, it seems to be due to the use of
boxplot.stats()
here: https://github.com/HannaMeyer/CAST/blob/afcba3f14426c0d92212a4eeb5e2c4e39870c542/R/trainDI.R#L221That gives us the threshold that CAST returns:
But I'm not entirely sure what
boxplot.stats()
actually does. For instance, imagine that we cut off the last di value in our vector:Because it's a rather low number, both our 75% percentile and IQR increase:
But
boxplot.stats()
returns the same value as before:Created on 2022-12-11 by the reprex package (v2.0.1)
Apologies if I'm misunderstanding something here! The return here just didn't match my expectations.