`aoa()` appears to return incorrect thresholds (different from Meyer & Pebesma 2021)

mikemahoney218 commented 1 year ago

Hi all,

Adapting some code from the MEE-AOA repo, I believe I can calculate an AOA like this:

set.seed(123)

library(CAST)
library(caret)
library(virtualspecies)

npoints <- 50
meansPCA <- c(3, -1)
sdPCA <- c(2, 2)
simulateResponse <- c("bio2","bio5","bio10", "bio13", "bio14","bio19")
studyarea <- c(-15, 65, 30, 75)
predictors_global <- raster::brick(
  system.file(
    "extdata/bioclim_global.grd", 
    package = "CAST"
  )
)

predictors <- crop(predictors_global, extent(studyarea))
mask <- predictors[[1]]
values(mask)[!is.na(values(mask))] <- 1
response_vs <- generateSpFromPCA(
  predictors[[simulateResponse]],
  means = meansPCA,
  sds = sdPCA, 
  plot = FALSE
)
response <- response_vs$suitab.raster
mask <- rasterToPolygons(mask,dissolve=TRUE)

samplepoints <- spsample(mask,npoints,"random")
trainDat <- extract(predictors,samplepoints,df=TRUE)
trainDat$response <- extract (response,samplepoints)
trainDat <- trainDat[complete.cases(trainDat),]

model <- train(trainDat[,names(predictors)],
               trainDat$response,
               method="rf",
               importance=TRUE,
               trControl = trainControl(method="none"))

AOA <- aoa(trainDat, model=model)

According to the 2021 paper, I believe the AOA threshold after this should be equal to "the 75-percentile plus 1.5 times the IQR of the DI values of the cross-validated training data". Calculating that using quantile and IQR gives us these results:

di <- attr(AOA$AOA, "TrainDI")

(threshold_quantile <- stats::quantile(di, 0.75))
#>       75% 
#> 0.3059488
(threshold_iqr <- (1.5 * stats::IQR(di)))
#> [1] 0.3392091
threshold_quantile + threshold_iqr
#>       75% 
#> 0.6451579

But the AOA threshold returned by aoa() doesn't match that calculation:

AOA$parameters$threshold
#> [1] 0.4770295

If I'm right and this is unexpected, it seems to be due to the use of boxplot.stats() here: https://github.com/HannaMeyer/CAST/blob/afcba3f14426c0d92212a4eeb5e2c4e39870c542/R/trainDI.R#L221

That gives us the threshold that CAST returns:

grDevices::boxplot.stats(di)$stats[5]
#> [1] 0.4770295

But I'm not entirely sure what boxplot.stats() actually does. For instance, imagine that we cut off the last di value in our vector:

di[50]
#> [1] 0.2120274
di <- di[1:49]

Because it's a rather low number, both our 75% percentile and IQR increase:

(threshold_quantile <- stats::quantile(di, 0.75))
#>       75% 
#> 0.3101567
(threshold_iqr <- (1.5 * stats::IQR(di)))
#> [1] 0.3523555
threshold_quantile + threshold_iqr
#>       75% 
#> 0.6625121

But boxplot.stats() returns the same value as before:

grDevices::boxplot.stats(di)$stats[5]
#> [1] 0.4770295

^{Created on 2022-12-11 by the reprex package (v2.0.1)}

Apologies if I'm misunderstanding something here! The return here just didn't match my expectations.

mikemahoney218 commented 1 year ago

Looking at the source of grDevices::boxplot.stats, I see:

> grDevices::boxplot.stats
function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE) 
{
    if (coef < 0) 
        stop("'coef' must not be negative")
    nna <- !is.na(x)
    n <- sum(nna)
    stats <- stats::fivenum(x, na.rm = TRUE)
    iqr <- diff(stats[c(2, 4)])
    if (coef == 0) 
        do.out <- FALSE
    else {
        out <- if (!is.na(iqr)) {
            x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * 
                iqr)
        }
        else !is.finite(x)
        if (any(out[nna], na.rm = TRUE)) 
            stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
    }
    conf <- if (do.conf) 
        stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
    list(stats = stats, n = n, conf = conf, out = if (do.out) x[out & 
        nna] else numeric())
}

The relevant lines here are:

out <- if (!is.na(iqr)) {
  x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef * iqr)
}
#> [...]
stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)

In context, that means that the AOA threshold winds up equaling:

max(di[!(di > (quantile(di, 0.75) + 1.5 * IQR(di)))])

Which means that the threshold is going to be the value in di closest to, but not more than quantile(di, 0.75) + (1.5 * IQR(di)), which, particularly for smaller data, may be a significantly different value. The returned value will always be lower than (or the same as) the 75th percentile plus 1.5 times the IQR. Is this expected behavior?

HannaMeyer commented 1 year ago

Thanks for finding that! I fixed it according to your suggestion.

HannaMeyer / CAST

`aoa()` appears to return incorrect thresholds (different from Meyer & Pebesma 2021) #46