easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
1.02k stars 89 forks source link

`check_outliers`: Clarification about thresholds and outlier scores #467

Closed rempsyc closed 2 years ago

rempsyc commented 2 years ago

This is low priority, but just putting out there to free up some space in PR #443.


There are essentially four questions relating to thresholds in check_outliers and outlier scores:

devtools::load_all()
#> ℹ Loading performance

# IQR
fun.iqr <- function(x) {
  sapply(x, function(y) {
    y <- as.data.frame(y)
    df <- .check_outliers_iqr(y)
    unique(df$data_iqr$Distance_IQR)
  }) |> unlist()
  }

fun.iqr(mtcars)
#>  mpg1  mpg2   cyl  disp   hp1   hp2  drat   wt1   wt2 qsec1 qsec2    vs    am 
#>     0     1     0     0     0     1     0     0     1     0     1     0     0 
#>  gear carb1 carb2 
#>     0     0     1
fun.iqr(iris[-5])
#> Sepal.Length Sepal.Width1 Sepal.Width2 Petal.Length  Petal.Width 
#>            0            0            1            0            0
fun.iqr(airquality)
#>   Ozone1   Ozone2   Ozone3 Solar.R1 Solar.R2    Wind1    Wind2     Temp 
#>        0       NA        1        0       NA        0        1        0 
#>    Month      Day 
#>        0        0

# CI
fun.ci <- function(x) {
  sapply(x, function(y) {
    y <- as.data.frame(y)
    df <- .check_outliers_ci(y, method = "ci")
    unique(df$data_ci$Distance_ci)
  }) |> unlist()
}

fun.ci(mtcars)
#>  mpg1  mpg2   cyl disp1 disp2   hp1   hp2 drat1 drat2   wt1   wt2 qsec1 qsec2 
#>     0     1     0     0     1     0     1     0     1     0     1     0     1 
#>    vs    am  gear carb1 carb2 
#>     0     0     0     0     1
fun.ci(iris[-5])
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           0            0           0
#> [2,]            1           1            1           1
fun.ci(airquality)
#>   Ozone1   Ozone2   Ozone3 Solar.R1 Solar.R2 Solar.R3    Wind1    Wind2 
#>        0      NaN        1        0      NaN        1        0        1 
#>    Temp1    Temp2    Month     Day1     Day2 
#>        0        1        0        0        1

# zscore can differ from 0 and 1
fun.zscore <- function(x) {
  sapply(x, function(y) {
    y <- as.data.frame(y)
    df <- .check_outliers_zscore(y)
    unique(df$data_zscore$Distance_Zscore)
  }) |> unlist()
}

fun.zscore(mtcars) |> head()
#>     mpg1     mpg2     mpg3     mpg4     mpg5     mpg6 
#> 1.000000 1.626170 2.439254 2.069670 2.716442 1.090273
fun.zscore(iris[-5]) |> head()
#> Sepal.Length1 Sepal.Length2 Sepal.Length3 Sepal.Length4 Sepal.Length5 
#>      1.000000      1.059914      1.156270      1.348982      1.445337 
#> Sepal.Length6 
#>      1.252626
fun.zscore(airquality) |> head()
#>   Ozone1   Ozone2   Ozone3   Ozone4   Ozone5   Ozone6 
#> 1.000000 1.175541 1.059914 3.218284 1.522422 3.989131

Created on 2022-08-16 by the reprex package (v2.0.1)

rempsyc commented 2 years ago

@bwiernik I'm planning on submitting 443 before this is resolved, but just pinging you in case you have some opinion on this that I could integrate before the merge.

bwiernik commented 2 years ago

Which do you want me to look at?

rempsyc commented 2 years ago

All four questions above. But we can start small, with no 1.

(You can skip the reprex, it was only a demonstration of the point)

bwiernik commented 2 years ago

Edit: I misunderstood how IQR works. I fixed the comparison of it below.

  1. 3 SDs seems appropriate to me. 1.96 will flag way too many. The Tukey 1.5 IQR threshold used for boxplots is equivalent to 2.7 SDs, so it will flag fewer than 1.95 for z score and confidence interval, but more than 3 SDs. I think ensuring similarity in thresholds across methods is important. I suggest hewing closer to 3 SDs. So, thats 3 for zscore, 1.7 IQR for IQR, .001 for the p-value like ones, .999 for the ci-like ones. For Cook's D, it the median of the implied F distribution, which is a good heuristic that doesn't flag too many influential cases. Pareto is similar. The OPTICS method I don't really know anything about.

So that's:


list(
  zscore = stats::qnorm(p = 1 - 0.001),
  iqr = 1.7,
  ci = 0.999,
  cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
  pareto = 0.7,
  mahalanobis = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
  robust = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
  mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
  ics = 0.001,
  optics = 2 * ncol(x),
  iforest = 0.001,
  lof = 0.001
)

What do folks think?

  1. Yes please

3 and 4. That seems like a bug. I think it might be returning logical of if the points are outside the threshold and those are coerced to numbers?

rempsyc commented 2 years ago

Thanks. For reference, here are the other thresholds in case you'd like me to change any other one:

list(
  zscore = stats::qnorm(p = 1 - 0.025),
  zscore_robust = stats::qnorm(p = 1 - 0.025),
  iqr = 1.5,
  ci = 0.95,
  eti = 0.95,
  hdi = 0.95,
  bci = 0.95,
  cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
  pareto = 0.7,
  mahalanobis = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
  mahalanobis_robust = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
  mcd = stats::qchisq(p = 1 - 0.025, df = ncol(x)),
  ics = 0.025,
  optics = 2 * ncol(x),
  iforest = 0.025,
  lof = 0.025
)
rempsyc commented 2 years ago

Ok I have resolved no 3. It is because the threshold is not used on the distance score as in other methods, but in an earlier step to calculate the distance score (which ends up 0 or 1). Therefore, it is working as intended and not a problem.

That only leaves no 4.

rempsyc commented 2 years ago

About no 4: I think it is not a bug. I will make a reprex shortly to show that it is calculated like this on purpose. However, it does beg the question of whether it is desirable behaviour and what could be changed.

bwiernik commented 2 years ago

I edited my comment above at the same time you were responding, so I've made some suggested threshold changes there

bwiernik commented 2 years ago

I'm not really understanding what computation is happening in 4. Can you walk me through it?

rempsyc commented 2 years ago

Here is the way IQR distances are calculated. Is this a bug or intentional, and should we change it?

x <- mtcars
threshold = 1.5
method = "tukey"
d <- data.frame(Row = seq_len(nrow(as.data.frame(x))))

# This is the baddie function!
for (col in seq_len(ncol(as.data.frame(x)))) {
  v <- x[, col]
  if (method == "tukey") {
    iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE)
  } else {
    iqr <- stats::IQR(v, na.rm = TRUE)
  }
  lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold)
  upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold)
  d[names(as.data.frame(x))[col]] <- ifelse(v > upper, 1,
                                            ifelse(v < lower, 1, 0)
  )
}

head(d)
#>   Row mpg cyl disp hp drat wt qsec vs am gear carb
#> 1   1   0   0    0  0    0  0    0  0  0    0    0
#> 2   2   0   0    0  0    0  0    0  0  0    0    0
#> 3   3   0   0    0  0    0  0    0  0  0    0    0
#> 4   4   0   0    0  0    0  0    0  0  0    0    0
#> 5   5   0   0    0  0    0  0    0  0  0    0    0
#> 6   6   0   0    0  0    0  0    0  0  0    0    0

So if we decompose this to understand where 0 and 1 are coming from:

col
#> [1] 11
(v <- x[, col])
#>  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
(iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE))
#> 75% 
#>   2
(lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold))
#> 25% 
#>  -1
(upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold))
#> 75% 
#>   7

# This is the culprit: an ifelse with 1 or 0 as outcome!
ifelse(v > upper, 1, ifelse(v < lower, 1, 0))
#>  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

Created on 2022-08-24 by the reprex package (v2.0.1)

Same ifelse logic for ci methods.

bwiernik commented 2 years ago

Oh, I was misunderstanding the computation/threshold for the IQR. The existing threshold of 1.5 is equivalent to 2.7 SDs. 1.7 would be equivalent to 3 SDs. I edited the post above.

rempsyc commented 2 years ago

Would it make sense to report the actual percentile as the IQR distance score, regardless of whether method = tukey or not, since only the cut-off values would change in my understanding? And would dividing the rank by the total number of observations be the right formula? E.g.,

(v <- mtcars$carb)
#>  [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2
threshold = 0.2

Distance_IQR <- rank(v, na.last = "keep")/length(v)

And then we can still identify outliers with the upper and lower boundaries as calculated before:

(iqr <- stats::quantile(v, 0.75, na.rm = TRUE) - stats::quantile(v, 0.25, na.rm = TRUE))
#> 75% 
#>   2
(lower <- stats::quantile(v, 0.25, na.rm = TRUE) - (iqr * threshold))
#> 25% 
#> 1.6
(upper <- stats::quantile(v, 0.75, na.rm = TRUE) + (iqr * threshold))
#> 75% 
#> 4.4

Outlier_IQR <- (ifelse(v > upper, 1, ifelse(v < lower, 1, 0)))

out <- data.frame(Distance_IQR, Outlier_IQR)
out
#>    Distance_IQR Outlier_IQR
#> 1      0.796875           0
#> 2      0.796875           0
#> 3      0.125000           1
#> 4      0.125000           1
#> 5      0.390625           0
#> 6      0.125000           1
#> 7      0.796875           0
#> 8      0.390625           0
#> 9      0.390625           0
#> 10     0.796875           0
#> 11     0.796875           0
#> 12     0.593750           0
#> 13     0.593750           0
#> 14     0.593750           0
#> 15     0.796875           0
#> 16     0.796875           0
#> 17     0.796875           0
#> 18     0.125000           1
#> 19     0.390625           0
#> 20     0.125000           1
#> 21     0.125000           1
#> 22     0.390625           0
#> 23     0.390625           0
#> 24     0.796875           0
#> 25     0.390625           0
#> 26     0.125000           1
#> 27     0.390625           0
#> 28     0.390625           0
#> 29     0.796875           0
#> 30     0.968750           1
#> 31     1.000000           1
#> 32     0.390625           0

Then for the composite score of all columns, we could take the max of all columns (like zscore, etc.), rather than the average.

Created on 2022-08-25 by the reprex package (v2.0.1)

strengejacke commented 2 years ago

@rempsyc Could you please add a short text to the NEWS file about the changes? I'm not sure I can appropriately summarise your whole work.

bwiernik commented 2 years ago

I would prefer to use non-thresholded stats for everything and only use the thresholds when drawing threshold lines on plots or when flagging cases

DominiqueMakowski commented 2 years ago

or when flagging cases

Yup that's the purpose of check_outliers, to identify possible outliers

rempsyc commented 2 years ago

@bwiernik Would it make statistical sense to report the actual percentile as the IQR distance score? What about the CI distance score?

bwiernik commented 2 years ago

Yes for both of those

rempsyc commented 2 years ago

Ok so if everyone is alright with that, I'm going to submit a PR with all our agreed changes, including changing the distance scores to percentiles for IQR and all CI methods.