Investigating the high % of outliers detected with the MCD method

rempsyc commented 9 months ago

We have discussed how the MCD method's default threshold seems to provide high percentages of outlier detections in real-life datasets, and also as exemplified in the mtcars dataset. The reprex below aims to investigate this issue a bit more systematically.

Code for compare_methods() function

``` r compare_methods <- function(data) { data <- datawizard::data_select(data, select = is.numeric) data[is.na(data)] <- 0 df <- data.frame(method = c( "mahalanobis", "mahalanobis_robust", "mcd", "leys_mcd")) x <- lapply(df$method[1:3], \(x) { performance::check_outliers(data, method = x) |> which() |> length() }) |> unlist() leys_mcd <- Routliers::outliers_mcd(data)$nb outliers <- c(x, leys_mcd) data.frame(df, outliers, percent = round(outliers / nrow(data), 2)) } ```

list.df <- tibble::lst(mtcars, iris, ToothGrowth, airquality)
lapply(list.df, compare_methods)
#> $mtcars
#>               method outliers percent
#> 1        mahalanobis        0    0.00
#> 2 mahalanobis_robust        8    0.25
#> 3                mcd       10    0.31
#> 4           leys_mcd       15    0.47
#> 
#> $iris
#>               method outliers percent
#> 1        mahalanobis        0    0.00
#> 2 mahalanobis_robust        4    0.03
#> 3                mcd        3    0.02
#> 4           leys_mcd       53    0.35
#> 
#> $ToothGrowth
#>               method outliers percent
#> 1        mahalanobis        0     0.0
#> 2 mahalanobis_robust        0     0.0
#> 3                mcd        6     0.1
#> 4           leys_mcd        0     0.0
#> 
#> $airquality
#>               method outliers percent
#> 1        mahalanobis        0    0.00
#> 2 mahalanobis_robust        1    0.01
#> 3                mcd        6    0.04
#> 4           leys_mcd       27    0.18

^{Created on 2024-01-24 with reprex v2.0.2}

We can see that indeed, from the mtcars dataset, our default threshold MCD method flags a high percentage of outliers (31%), but we can also see that Ley's MCD function's (Routliers::outliers_mcd()) default threshold (which is more conservative) tags a whopping 47% outliers, almost half of the sample!

However, this could be an artifact from the mtcars dataset, since the percentages are much lower for iris, ToothGrowth, airquality, ranging from 1% (our MCD) to 35% (Ley's MCD).

Next, I will try to get the exact percentages I was getting in my own real-life datasets later.

rempsyc commented 9 months ago

Ok, so here's the result from my real-life data I'm actually trying to publish right now. My sample size is 203 at this point in the script, and this dataframe contains only 33 dependent variables of interest actually used in the analyses (only scale averages when applicable, so not individual items):

> compare_methods(data.na[-c(15:16)])
              method outliers percent
1        mahalanobis        4    0.02
2 mahalanobis_robust       52    0.26
3                mcd       60    0.30
4           leys_mcd       86    0.42

(The full code is available here: https://remi-theriault.com/scripts/varela#Assumptions, “Multivariate outliers” sub-tab)

So the methods detect a whopping 42% of my sample as outliers for Ley's MCD, and 30% for our MCD, compared to only 2% for the regular Mahalanobis distance! Of course, throwing away 30-40% of my data isn't practical for publication purposes...

One colleague who read our outlier paper tried our MCD method on his data and was getting similar results. Perhaps @CharlesEtienneLavoie can share the percentage of outliers detected from his sample so we can have another real-life point of comparison.

rempsyc commented 9 months ago

?Routliers::outliers_mcd provides this useful bit of information:

h   proportion of dataset to use in order to compute sample means and covariances
# The default is to use 75% of the datasets in order to compute sample means and covariances
# This proportion equals 1-breakdown points (i.e. h = .75 <--> breakdown points = .25)
# This breakdown points is encouraged by Leys et al. (2018)

Leys et al. (2018) write:

Secondly, Table 1 provides estimations of the correlations (and SD) using Mahalanobis distance, MCD50 (using a sub-sample of h = n/2, hence a breakdown point of 0.5), MCD75 (using a sub-sample of h = 3n/4, hence a breakdown point of 0.25) methods to remove outliers as well as the true and false detection rates. It shows that MCD75 always yields the best estimations. It has the most efficient detection of outlying values as well as an acceptable false detection rate. Indeed, the MCD50 bases its estimates on a smaller sub-sample than MCD75 and, therefore, is less reliable. The basic Mahalanobis distance is seriously disturbed by outliers, hence not reliable at all. Note that the smaller the correlation, the harder it is to detect outliers, although MCD75 remains the best choice (a correlation of .1 implies that Mahalanobis method will not detect any outliers even in large samples). Of course, the superiority of MCD75 holds as long as there are less than 25% outliers (which is true of most social psychological research). We ran a simulation on 500 observations with 30% outliers. In this situation MCD50 becomes the best indicator (the basic Mahalanobis and MCD75 become totally unreliable).

Perhaps it has to do with this. Routliers::outliers_mcd permits changing this number with the h argument. This does not appear documented, but from the code we can see check_outliers has a similar percentage_central = 0.50 argument. So we use .50 while Leys recommend .75, except in cases of > 30% outliers. Perhaps @bwiernik can provide some recommandations or impressions here.

mattansb commented 9 months ago

Doesn't the robust Mahalanobis also seem rather high?

In well behaved data, I would expect 0-3% false positives, regardless of method.

rempsyc commented 9 months ago

Absolutely. It is generally a bit better than the other two, but they are all quite close.

rempsyc commented 9 months ago

Just an idea - perhaps this could be a discussion point of the paper, another limitation of the MCD method, that in certain cases it detects up to almost half the sample as outliers, thus it provides more support in favour of model-based methods...

rempsyc commented 9 months ago

I'm also not able to make our MCD method match Ley's MCD method easily, using the same alpha value and regardless of the percentage_central argument.

alpha nominal type I error probability (by default .01)

alpha <- 0.1
set.seed(42)
Routliers::outliers_mcd(mtcars, alpha = alpha)$nb
#> total 
#>    15

mcd_threshold = stats::qchisq(p = 1 - 0.001, df = ncol(x)),

library(performance)
set.seed(42)
check_outliers(mtcars, method = "mcd", 
               percentage_central = 50,
               threshold = stats::qchisq(p = 1 - alpha, df = ncol(mtcars))) |>
  which() |>
  length()
#> [1] 11

set.seed(42)
check_outliers(mtcars, method = "mcd", 
               percentage_central = 25,
               threshold = stats::qchisq(p = 1 - alpha, df = ncol(mtcars))) |>
  which() |>
  length()
#> [1] 11

set.seed(42)
check_outliers(mtcars, method = "mcd", 
               percentage_central = 75,
               threshold = stats::qchisq(p = 1 - alpha, df = ncol(mtcars))) |>
  which() |>
  length()
#> [1] 11

^{Created on 2024-01-24 with reprex v2.0.2}

rempsyc commented 9 months ago

Think I found part of the issue, percentage_central defaults to 50 in .check_outliers_mcd():

https://github.com/easystats/performance/blob/eac5e9824cf64e372a46a8d3eb17947329e9cb84/R/check_outliers.R#L1727-L1730

but is actually hard-coded at 66 earlier.....

https://github.com/easystats/performance/blob/eac5e9824cf64e372a46a8d3eb17947329e9cb84/R/check_outliers.R#L1096-L1103

rempsyc commented 9 months ago

This is how to make them match (after correcting percentage_central):

alpha <- 0.1
h <- .50
Routliers::outliers_mcd(mtcars, alpha = alpha, h = h)$nb
#> total 
#>    15

performance:::.check_outliers_mcd( 
  mtcars, 
  threshold = stats::qchisq(p = 1 - alpha, df = ncol(mtcars)), 
  percentage_central = h)$data_mcd$Outlier_MCD |> 
  sum()
#> [1] 15

^{Created on 2024-01-24 with reprex v2.0.2}

Funny thing is that although Leys recommend .75, and claim that this is their default, and it is here:

https://github.com/mdelacre/Routliers/blob/e4dac8814ea53db87083ef59d468c05dc6cedeea/R/outliers_mcd.R#L39-L43

They also overwrite it later in the code:

https://github.com/mdelacre/Routliers/blob/e4dac8814ea53db87083ef59d468c05dc6cedeea/R/outliers_mcd.R#L89

mattansb commented 9 months ago

Might be worth opening an issue over by them?

bwiernik commented 9 months ago

The first thing I would check is if the output for robust and MCD are actually in the chi square metric and not already processed into something like a proportion or p value. I think robust might be pre-processed based on a quick glance at the code, but didn't get a chance to dig deep yesterday.

For MCD, if it is a chi square, the df may be wrong given that we are using the full sample for the df but only 66% for the centroid?

rempsyc commented 9 months ago

This is how the MCD distance scores are calculated:

x <- mtcars
out <- data.frame(Row = seq_len(nrow(x)))
percentage_central = .66

threshold <- stats::qchisq(p = 1 - 0.001, df = ncol(x))
threshold
#> [1] 31.26413

mcd <- MASS::cov.mcd(x, quantile.used = percentage_central * nrow(x))
out$Distance_MCD <- stats::mahalanobis(x, center = mcd$center, cov = mcd$cov)
out$Outlier_MCD <- as.numeric(out$Distance_MCD > threshold)
tail(out)
#>    Row Distance_MCD Outlier_MCD
#> 27  27    180.71323           1
#> 28  28    441.60787           1
#> 29  29    976.82407           1
#> 30  30    447.51966           1
#> 31  31   1651.65347           1
#> 32  32     10.65753           0

This is how the robust Mahalanobis distance scores are calculated:

out <- data.frame(Row = seq_len(nrow(x)))
U <- svd(scale(x))$u
out$Distance_Mahalanobis_robust <- bigutilsr::dist_ogk(U)
out$Outlier_Mahalanobis_robust <- as.numeric(
  out$Distance_Mahalanobis_robust > threshold
)
tail(out)
#>    Row Distance_Mahalanobis_robust Outlier_Mahalanobis_robust
#> 27  27                   105.41747                          1
#> 28  28                    44.20559                          1
#> 29  29                   260.12084                          1
#> 30  30                    20.46648                          0
#> 31  31                   139.65254                          1
#> 32  32                    10.51291                          0

Given that the ouput is in the chi-square metric for both robust and MCD, how would you suggest changing the df? Note that Leys also uses the df of the full sample like we currently do.

^{Created on 2024-01-25 with reprex v2.0.2}

rempsyc commented 8 months ago

(Based on Mattan's suggestion:) In well-behaved data with few variables (a dozen or so), the MCD method seems fine, with few outliers detected (1%). However, with more variables (as in real datasets), the outlier overdetection issue becomes a lot more obvious, a lot more quickly (25% with 21 normally-distributed variables).

library(performance)

set.seed(42)
df <- rnorm(n = 100, mean = 100, sd = 10) |>
  replicate(n = 25) |>
  as.data.frame()

rempsyc::nice_normality(df, "V1")


set.seed(42)
check_outliers(df[1:12], method = "mcd")
#> 1 outlier detected: case 59.
#> - Based on the following method and threshold: mcd (32.909).
#> - For variables: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12.

set.seed(42)
check_outliers(df[1:21], method = "mcd")
#> 25 outliers detected: cases 7, 8, 9, 10, 11, 12, 18, 25, 28, 29, 34, 36,
#>   37, 40, 44, 47, 55, 59, 61, 65, 67, 69, 82, 83, 90.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13,
#>   V14, V15, V16, V17, V18, V19, V20, V21.

^{Created on 2024-02-02 with reprex v2.0.2}

From this, we can conclude that the issue does not seem to stem from non-normally distributed variables, but rather from the number of variables.

mattansb commented 8 months ago

This might be related to the curse of dimensionality? As the number of dimensions increases, the typical distance from the centroid is larger. Regardless, this should be resolved... Maybe it's time to ping the authors of the paper?

mattansb commented 8 months ago

This does seem to be a function of the N/p (N = sample size; p = number of parameters) ratio. When it is larger than 10, the % of outliers flagged is okay (in well behaved data). This makes sense: the MCD looks at the cov matrix of subsamples of the data - with high dimensional data, small samples sizes will give highly variable cov matrices, as so the "smallest" one will probably miss-represent the data.

The solution might be to set the sub-sample size to be large enough, as a function of N / p.

n_outliers_MCD <- function(N, p) {
  data <- data.frame(MASS::mvrnorm(N, rep(0, p), diag(p)))

  results <- performance::check_outliers(data, method = "mcd")
  sum(results)
}

grid <- expand.grid(
  N = c(50, 100, 150, 200, 300, 500),
  p = c(5, 10, 15, 20, 25),
  rep = 1:5
)

grid$n_outliers <- mapply(n_outliers_MCD, N = grid$N, p = grid$p)

library(ggplot2)
library(scales)

ggplot(grid, aes(N, n_outliers/N, fill = ordered(p))) + 
  geom_smooth(aes(color = ordered(p)), se = FALSE) + 
  geom_boxplot(aes(group = interaction(p, N))) + 
  expand_limits(y = 0) + 
  scale_y_continuous(labels = label_percent()) + 
  scale_x_log10() + 
  theme_bw()
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


ggplot(grid, aes(N/p, n_outliers/N, fill = ordered(p))) + 
  geom_smooth(aes(color = ordered(p)), se = FALSE) + 
  geom_boxplot(aes(group = interaction(p, N))) + 
  expand_limits(y = 0) + 
  scale_y_continuous(labels = label_percent()) + 
  scale_x_log10() + 
  theme_bw()
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

^{Created on 2024-02-03 with reprex v2.0.2}

strengejacke commented 8 months ago

So could we print a message for the user when the N/p ratio is inadequate?

mattansb commented 8 months ago

Yes, I think that is a good idea.

rempsyc commented 8 months ago

Great demonstration / finding Mattan!

easystats / performance

Investigating the high % of outliers detected with the MCD method #672