Closed tappituffi closed 2 months ago
Dear Niklas, thank you very much for following-up with this issue after your email. I was wondering if you could include some example code for both check_model()
and check_outliers()
so we are better able to investigate what is happening.
If using your own data, you probably cannot share it, however, you could share the code and output that generated it to give us an idea and test with a reproducible example on our end. Even better, would be if you would be able to generate a reprex
using base R data, for example included in the help()
examples of the linear mixed model function you use.
Hi,
I am not familiar with reprex etc but I could generate a reproducible code that has the same issue. When you run below code it will generate the check_model plots and shows only a red dot for 40. However, all other points outside of the contour line are not in red even though they should be outliers? Also, when you run check_outlier, it only states the 40 as outlier and when you run plot(outlier) it doesn't even show the dot for 40 in red. So I guess this code shows the issue I have for my study. So my question is if points are outside the contour line, should they be regarded as outliers and hence colored in red or not? Thanks in advance!
# Install necessary packages if they are not already installed
if (!require("lme4")) install.packages("lme4", dependencies = TRUE)
if (!require("performance")) install.packages("performance", dependencies = TRUE)
# Load the packages
library(lme4)
library(performance)
library(see)
# Create the data frame
set.seed(123) # for reproducibility
n <- 100
subjectID <- paste("Subject", rep(1:10, each = 10))
PN <- runif(n, 0, 100)
alpha <- 0.5 * PN + rnorm(n, mean = 50, sd = 10)
# Introducing outliers
alpha[c(20, 40)] <- c(150, 160) # arbitrarily chosen subjects for outliers
data <- data.frame(subjectID, PN, alpha)
# Fit the mixed model
model <- lmer(alpha ~ PN + (1 | subjectID), data = data)
outlier <- check_outliers(model)
outlier
plot(outlier)
check_model(model)
Thank you for providing reproducing code. For future reference, I have turned it into a reprex using the reprex
package which allows us to also see the visual output of your code:
# Load the packages
library(lme4)
library(performance)
library(see)
# Create the data frame
set.seed(123) # for reproducibility
n <- 100
subjectID <- paste("Subject", rep(1:10, each = 10))
PN <- runif(n, 0, 100)
alpha <- 0.5 * PN + rnorm(n, mean = 50, sd = 10)
# Introducing outliers
alpha[c(20, 40)] <- c(150, 160) # arbitrarily chosen subjects for outliers
data <- data.frame(subjectID, PN, alpha)
# Fit the mixed model
model <- lmer(alpha ~ PN + (1 | subjectID), data = data)
outlier <- check_outliers(model)
outlier
#> 1 outlier detected: case 40.
#> - Based on the following method and threshold: cook (0.7).
#> - For variable: (Whole model).
# plot(outlier)
x <- check_model(model)
plot1 <- plot(x$OUTLIERS)
plot1
plot2 <- plot(x)$OUTLIERS
plot2
Created on 2024-04-28 with reprex v2.1.0
We can see that plot(x$OUTLIERS)
(check_outliers()
) and plot(x)$OUTLIERS
(from check_model()
) provide different results. It seems that this difference comes from the see
packages rather than the performance
package. Specifically, it comes from the function used internally, .plot_diag_outliers_new()
.
# Prepare model
library(see)
library(lme4)
library(performance)
set.seed(123) # for reproducibility
n <- 100
subjectID <- paste("Subject", rep(1:10, each = 10))
PN <- runif(n, 0, 100)
alpha <- 0.5 * PN + rnorm(n, mean = 50, sd = 10)
alpha[c(20, 40)] <- c(150, 160) # arbitrarily chosen subjects for outliers
data <- data.frame(subjectID, PN, alpha)
model <- lmer(alpha ~ PN + (1 | subjectID), data = data)
x <- check_model(model)
# These two functions provide different results:
# `check_model()`
see:::.plot_diag_outliers_new(x$INFLUENTIAL)
# `check_outliers()`
see:::plot.see_check_outliers(x$OUTLIERS)
# Methods don't agree on observation 40
x$INFLUENTIAL[40, ]
#> Hat Cooks_Distance Predicted Residuals Std_Residuals Index
#> 40 0.04070025 0.9471816 65.41945 94.58055 94.58055 40
#> Influential
#> 40 Influential
attributes(x$OUTLIERS)$influential_obs[40, ]
#> Hat Cooks_Distance Predicted Residuals Std_Residuals Index
#> 40 0.04070025 0.9471816 65.41945 94.58055 94.58055 40
#> Influential
#> 40 OK
# Even though it is identified correctly here:
attributes(x$OUTLIERS)$data[40, ]
#> Row Distance_Cook Outlier_Cook Outlier
#> 40 40 0.9471816 1 1
# `check_model()`
attributes(x$INFLUENTIAL)$cook_levels
#> [1] 0.698073
As the cook’s distance is 0.94, which is higher than 0.7, it should be tagged as outlier though, so this appears like an internal bug from check_outliers()
.
Created on 2024-04-28 with reprex v2.1.0
I have implemented a fix in PR #717, thank you for the report
Hi, -> When I look at the outlier plot, there are some samples that fall outside the contour line but are not flagged in red, as demonstrated in the performance tutorial. Also, when I check the model with the check_outliers function, it says I don't have any outliers. So my question is why those samples are not flagged as outliers even though they are outside the contour lines. Has this something to do with that I am using mixed models or how could I explain this?
Thanks in advance and best, Niklas