easystats / see

:art: Visualisation toolbox for beautiful and publication-ready figures
https://easystats.github.io/see/
Other
874 stars 43 forks source link

Plotting normality of residuals: Scaling issues / differences #335

Closed mutlusun closed 2 months ago

mutlusun commented 5 months ago

Dear package authors,

Thank you very much for all your work and effort for the easystats packages. I like and use them a lot and they make my life definitely easier.

Recently, I stumbled upon a scaling "issue" with a data set consisting of two time points. Here is a small example for examination:

set.seed(1234)
x <- rnorm(100)
y <- rnorm(100, mean = x, sd = .5)
myd <- data.frame(x, y)

When fitting a regression model to this data and plotting the residuals, I get the impression that something is off here:

fit <- lm(y ~ x, data = myd)
tmp <- performance::check_normality(fit)
plot(tmp, type = "qq")

see-wo-qqplotr-001

However, plotting in a different way doesn't show such a strong deviation:

plot(tmp, type = "qq", detrend = FALSE)

see-wo-qqplotr-002

First of all, I'm not sure whether these differences in visualization are an "issue" of this package. When looking at the y-axis in the first plot, it is clear that the deviations don't have a strong magnitude. However, I still find it a bit strange that the same function suggests different interpretations.

Would it be an option to set certain minimal limits for the plot in the case of detrend = TRUE? I'm happy to implement this but wanted to ask for your thoughts before.

Note: This is not such an issue with qqplotr installed as the bands are often very wide and the interval wider.

bwiernik commented 5 months ago

The second plot shows a very strong deviation. Those tails are much wider than expected under a normal distribution. The first plot better shows the amount of the deviation by removing the angular axis.

strengejacke commented 5 months ago

This issue is related to https://github.com/easystats/performance/pull/643#issuecomment-1999548863. When you look at the examples (https://easystats.github.io/see/articles/performance.html#qq-plot), you see that detrend = TRUE often works well. But sometimes, due to the different limits of the y-axis, the plots look rather different (although they are not).

I still find it a bit strange that the same function suggests different interpretations.

It does not. It's just the scale that "zooms in" for detrend = TRUE, and actually, for detrend = TRUE, the deviation is usually easier to see.

That said, it may be surprising to users to see that large differences. Maybe we find a way to adjust the y-axis, so the differences are not that large.

bwiernik commented 5 months ago

We set the non-detrended y axis limits to fixed values if I recall correctly. We should do the same with appropriate values for the detrended y axis.

strengejacke commented 5 months ago

No, I think for both plots, we don't set limits, thus ggplot2 decides which range to use, based on the range of the data.

https://github.com/easystats/see/blob/1a090e10a3943588b78889dfb0ca365928992bfe/R/plot.check_normality.R#L214

The question is if it's possible to find reasonable y-limits for the detrended plot?

mutlusun commented 5 months ago

Thank you for your replies!

Sorry, for being sloppy in my description above. As both of you pointed out, the tails show a strong deviation and both plots show the same thing but the scaling is different. I am aware of that. I was, for example, more concerned with the range of -1 to 1 of the standard normal distribution quantiles which could be interpreted differently between both plots (when not paying attention to the limits of the y-axes).

The question is if it's possible to find reasonable y-limits for the detrended plot?

I have thought about that a while but I think it is hard to come up with reasonable y-limits. Is there a good rule of thumb for deciding when there is a deviation? (This way minimum y-limits could be set. But, I didn't find one and it is probably misleading to decide for one.)

As an alternative: Would it be possible to set qqplotr as a hard dependency for this plot? This could be an easy way to increase the y-limits a bit.

ClaudioZandonella commented 3 months ago

Adding on this topic, I found these results interesting.

qqplot with detrend = TRUE

Screenshot 2024-05-22 at 11 04 36

qqplot with detrend = FALSE

Screenshot 2024-05-22 at 11 04 23

They provide completely opposite results 🤯. This makes interpretation pretty troublesome. Running shapiro.test() on the residuals I obtain p-value = 0.8866.

Do you have any idea on how to interpret these opposite results?

This is not just a toy example, I initially obtained these results from a real study. I am not sure if these strange results are related to the characteristics of the study: small fixed effect (Marginal R2 ~ 12%), but very large random effect (conditional R2 ~ 90%), so the remaining residual variance is really small.

Below is a minimal working example to simulate data and reproduce the plots.

Click me - Snippet Code ``` library('tidyverse') library('ggplot2') library('lme4') set.seed(2024) # create data df <- expand_grid( id = 1:20, trial = 1:20, condition = 0:1 ) %>% mutate( across(c(id, trial, condition), factor) ) contrasts(df$condition) <- contr.sum(2) # generate random effect ran_ef <- rnorm(20, mean = 0, sd =.45) # get model matrix random effects my_model <- "y ~ condition + (1 | id)" df$y <- 1 # lFormula need y foo <- lFormula(eval(my_model), df) Z <- t(as.matrix(foo$reTrms$Zt)) # get model matrix fixed effects V <- model.matrix(~ condition, data = df) # simulate y df$y <- as.vector(V %*% c(1.4, .15) + Z %*% ran_ef + rnorm(nrow(df), mean = 0, sd = .15)) # plot data ggplot(df) + geom_point(aes(x = id, y = y, color = condition), alpha = .6, position = position_jitter(width = .2)) # fit model my_fit <- lmer(y ~ condition + (1 | id), data = df) performance::check_model(my_fit) performance::r2(my_fit) performance::check_normality(my_fit) shapiro.test(residuals(my_fit)) qqnorm(residuals(my_fit)) tmp <- performance::check_normality(my_fit) plot(tmp, type = "qq") plot(tmp, type = "qq", detrend = FALSE) ```

By the way, thank you for the amazing package❤️❤️

strengejacke commented 3 months ago

It seems like qqplotr is not installed? I get completely different plots:

library(easystats)
#> # Attaching packages: easystats 0.7.1.2
#> ✔ bayestestR  0.13.2.1    ✔ correlation 0.8.4.2  
#> ✔ datawizard  0.10.0.4    ✔ effectsize  0.8.8    
#> ✔ insight     0.19.11.2   ✔ modelbased  0.8.7    
#> ✔ performance 0.11.0.9    ✔ parameters  0.21.7.1 
#> ✔ report      0.5.8.2     ✔ see         0.8.4.1
library(lme4)
#> Loading required package: Matrix
set.seed(2024)

# create data
df <- expand.grid(
  id = 1:20,
  trial = 1:20,
  condition = 0:1
) |>
  to_factor()

contrasts(df$condition) <- contr.sum(2)

# generate random effect
ran_ef <- rnorm(20, mean = 0, sd = .45)

# get model matrix random effects
my_model <- "y ~ condition + (1 | id)"
df$y <- 1 # lFormula need y
foo <- lFormula(eval(my_model), df)
Z <- t(as.matrix(foo$reTrms$Zt))

# get model matrix fixed effects
V <- model.matrix(~condition, data = df)

# simulate y
df$y <- as.vector(V %*% c(1.4, .15) + Z %*% ran_ef + rnorm(nrow(df), mean = 0, sd = .15))

# fit model
my_fit <- lmer(y ~ condition + (1 | id), data = df)
tmp <- performance::check_normality(my_fit)
plot(tmp, type = "qq")

plot(tmp, type = "qq", detrend = FALSE)

Created on 2024-05-22 with reprex v2.1.0

ClaudioZandonella commented 3 months ago

Yes, thank you for the clarification!

After installing qqplotr I obtain more reasonable (normal😅) results

Before I obtained the message

For confidence bands, please install 'qqplotr'.

I thought it was only an "additional" feature, not a requirement.

Thank you for the immediate help!

strengejacke commented 3 months ago

I thought it was only an "additional" feature, not a requirement.

Yes, usually, it should indeed only be for CIs. But qqplotr has a detrend-argument that does the transformation of the plot, and w/o that package, we have written our own detrend-functions, which obviously doesn't work well in all situations.

@mattansb any ideas?

mattansb commented 3 months ago

The CIs are an additional feature enabled by qqplotr, but they really do make possible the interpretation of the the detrended QQ plot (and the non-detrended plot as well!)

strengejacke commented 3 months ago

I meant the own detrend-code:

https://github.com/easystats/see/blob/7dc72a6e9fed4b58adedf0a68d9f384ba7c80bba/R/plot.check_normality.R#L281-L304

Any ideas, why the above plot (https://github.com/easystats/see/issues/335#issuecomment-2124325706) deviates that much when qqplotr is not installed?

mattansb commented 3 months ago

Hmmmmm.... there's some scaling issue here. I'll look into it.

mattansb commented 3 months ago

This isn't 100%, but it's very close.........

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.3.3

g <- rnorm(20, sd = 15)

old

ggplot() + 
  stat_qq(aes(sample = g,
              y = after_stat(sample - theoretical),
              x = after_stat(theoretical)))

fix?

# Re scale?
N <- length(g)
SD <- sd(g) * sqrt((N - 1) / N)

ggplot() + 
  stat_qq(aes(sample = g,
              y = after_stat(sample - theoretical * SD),
              x = after_stat(theoretical * SD)))

similiar to qqplotr

ggplot() + 
  qqplotr::stat_qq_point(aes(sample = g), detrend = TRUE)

Created on 2024-05-22 with reprex v2.1.0

strengejacke commented 3 months ago

Thanks, I'll revise the code in see.

strengejacke commented 2 months ago

Scaling in combination with preserving the original axis limits seems to be fine.

Without qqplotr

set.seed(1234)
x <- rnorm(100)
y <- rnorm(100, mean = x, sd = .5)
myd <- data.frame(x, y)
fit <- lm(y ~ x, data = myd)
tmp <- performance::check_normality(fit)
plot(tmp, type = "qq")
#> For confidence bands, please install `qqplotr`.

plot(tmp, type = "qq", detrend = FALSE)
#> For confidence bands, please install `qqplotr`.

With qqplotr

set.seed(1234)
x <- rnorm(100)
y <- rnorm(100, mean = x, sd = .5)
myd <- data.frame(x, y)
fit <- lm(y ~ x, data = myd)
tmp <- performance::check_normality(fit)
plot(tmp, type = "qq")

plot(tmp, type = "qq", detrend = FALSE)

mutlusun commented 2 months ago

Thank you for taking care and fixing!