easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
965 stars 87 forks source link

Outlier detection in Linear mixed models failed? #711

Closed tappituffi closed 2 months ago

tappituffi commented 2 months ago
Screenshot 2024-04-14 at 11 03 00 AM

Hi, -> When I look at the outlier plot, there are some samples that fall outside the contour line but are not flagged in red, as demonstrated in the performance tutorial. Also, when I check the model with the check_outliers function, it says I don't have any outliers. So my question is why those samples are not flagged as outliers even though they are outside the contour lines. Has this something to do with that I am using mixed models or how could I explain this?

Thanks in advance and best, Niklas

rempsyc commented 2 months ago

Dear Niklas, thank you very much for following-up with this issue after your email. I was wondering if you could include some example code for both check_model() and check_outliers() so we are better able to investigate what is happening.

If using your own data, you probably cannot share it, however, you could share the code and output that generated it to give us an idea and test with a reproducible example on our end. Even better, would be if you would be able to generate a reprex using base R data, for example included in the help() examples of the linear mixed model function you use.

tappituffi commented 2 months ago

Hi,

I am not familiar with reprex etc but I could generate a reproducible code that has the same issue. When you run below code it will generate the check_model plots and shows only a red dot for 40. However, all other points outside of the contour line are not in red even though they should be outliers? Also, when you run check_outlier, it only states the 40 as outlier and when you run plot(outlier) it doesn't even show the dot for 40 in red. So I guess this code shows the issue I have for my study. So my question is if points are outside the contour line, should they be regarded as outliers and hence colored in red or not? Thanks in advance!

# Install necessary packages if they are not already installed
if (!require("lme4")) install.packages("lme4", dependencies = TRUE)
if (!require("performance")) install.packages("performance", dependencies = TRUE)

# Load the packages
library(lme4)
library(performance)
library(see)

# Create the data frame
set.seed(123)  # for reproducibility
n <- 100
subjectID <- paste("Subject", rep(1:10, each = 10))
PN <- runif(n, 0, 100)
alpha <- 0.5 * PN + rnorm(n, mean = 50, sd = 10)

# Introducing outliers
alpha[c(20, 40)] <- c(150, 160)  # arbitrarily chosen subjects for outliers

data <- data.frame(subjectID, PN, alpha)

# Fit the mixed model
model <- lmer(alpha ~ PN + (1 | subjectID), data = data)
outlier <- check_outliers(model)
outlier
plot(outlier)
check_model(model)
rempsyc commented 2 months ago

Thank you for providing reproducing code. For future reference, I have turned it into a reprex using the reprex package which allows us to also see the visual output of your code:

# Load the packages
library(lme4)
library(performance)
library(see)

# Create the data frame
set.seed(123)  # for reproducibility
n <- 100
subjectID <- paste("Subject", rep(1:10, each = 10))
PN <- runif(n, 0, 100)
alpha <- 0.5 * PN + rnorm(n, mean = 50, sd = 10)

# Introducing outliers
alpha[c(20, 40)] <- c(150, 160)  # arbitrarily chosen subjects for outliers

data <- data.frame(subjectID, PN, alpha)

# Fit the mixed model
model <- lmer(alpha ~ PN + (1 | subjectID), data = data)
outlier <- check_outliers(model)
outlier
#> 1 outlier detected: case 40.
#> - Based on the following method and threshold: cook (0.7).
#> - For variable: (Whole model).
# plot(outlier)

x <- check_model(model)

plot1 <- plot(x$OUTLIERS)
plot1


plot2 <- plot(x)$OUTLIERS
plot2

Created on 2024-04-28 with reprex v2.1.0

rempsyc commented 2 months ago

We can see that plot(x$OUTLIERS) (check_outliers()) and plot(x)$OUTLIERS (from check_model()) provide different results. It seems that this difference comes from the see packages rather than the performance package. Specifically, it comes from the function used internally, .plot_diag_outliers_new().

# Prepare model
library(see)
library(lme4)
library(performance)
set.seed(123)  # for reproducibility
n <- 100
subjectID <- paste("Subject", rep(1:10, each = 10))
PN <- runif(n, 0, 100)
alpha <- 0.5 * PN + rnorm(n, mean = 50, sd = 10)
alpha[c(20, 40)] <- c(150, 160)  # arbitrarily chosen subjects for outliers
data <- data.frame(subjectID, PN, alpha)
model <- lmer(alpha ~ PN + (1 | subjectID), data = data)
x <- check_model(model)

# These two functions provide different results:
# `check_model()`
see:::.plot_diag_outliers_new(x$INFLUENTIAL)


# `check_outliers()`
see:::plot.see_check_outliers(x$OUTLIERS)


# Methods don't agree on observation 40
x$INFLUENTIAL[40, ]
#>           Hat Cooks_Distance Predicted Residuals Std_Residuals Index
#> 40 0.04070025      0.9471816  65.41945  94.58055      94.58055    40
#>    Influential
#> 40 Influential

attributes(x$OUTLIERS)$influential_obs[40, ]
#>           Hat Cooks_Distance Predicted Residuals Std_Residuals Index
#> 40 0.04070025      0.9471816  65.41945  94.58055      94.58055    40
#>    Influential
#> 40          OK

# Even though it is identified correctly here:
attributes(x$OUTLIERS)$data[40, ]
#>    Row Distance_Cook Outlier_Cook Outlier
#> 40  40     0.9471816            1       1

# `check_model()`
attributes(x$INFLUENTIAL)$cook_levels
#> [1] 0.698073

As the cook’s distance is 0.94, which is higher than 0.7, it should be tagged as outlier though, so this appears like an internal bug from check_outliers().

Created on 2024-04-28 with reprex v2.1.0

rempsyc commented 2 months ago

I have implemented a fix in PR #717, thank you for the report