easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
1.03k stars 94 forks source link

Weird edge case for `check_heteroskedasticity` plots #408

Open mattansb opened 2 years ago

mattansb commented 2 years ago

What's this weirdness?

library(performance)
#> Warning: package 'performance' was built under R version 4.1.3
library(see)

set.seed(1)
x <- rpois(360, 1.7)
y <- x + rnorm(length(x))

m <- lm(y ~ x)

plot(check_heteroskedasticity(m))
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : pseudoinverse used at -0.068209
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : neighborhood radius 2.0697
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : reciprocal condition number 1.9085e-015
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : There are other near singularities as well. 4.1579
#> Warning in predLoess(object$y, object$x, newx = if
#> (is.null(newdata)) object$x else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
#> -0.068209
#> Warning in predLoess(object$y, object$x, newx = if
#> (is.null(newdata)) object$x else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
#> 2.0697
#> Warning in predLoess(object$y, object$x, newx = if
#> (is.null(newdata)) object$x else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
#> number 1.9085e-015
#> Warning in predLoess(object$y, object$x, newx = if
#> (is.null(newdata)) object$x else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : There are other near
#> singularities as well. 4.1579

plot(x, y)

Created on 2022-03-24 by the reprex package (v2.0.1)

strengejacke commented 4 months ago

Could be related to #642, where @bwiernik suggested using a different smooth function for "non-continuous" scales.

bwiernik commented 4 months ago

Yeah

strengejacke commented 4 months ago

For these plots, we use fitted() for the x-axis, and scaled residuals for the y-axis. But when do we decide whether the x-axis is "categorical"? E.g., adding a continuous variable to the model makes the plot looking much more "usual":

set.seed(1)
d <- data.frame(x = rpois(360, 1.7), x2 = rnorm(360))
d$y <- d$x + rnorm(length(d$x))

m <- lm(y ~ x + x2, data = d)
performance::check_heteroscedasticity(m) |> plot()

fitted() in the above case return 360 unique values, the same as the number of observations.

For Mattan's example, fitted() returns 27 unique values, much less than the 360 observations:

set.seed(1)
d <- data.frame(x = rpois(360, 1.7))
d$y <- d$x + rnorm(length(d$x))

m <- lm(y ~ x, data = d)
length(unique(fitted(m)))
#> [1] 27

We must either think of a way how to determine the "spread" of data points across the x axis (even in the first example, they all "spread" around integer values), or whether we want to have at least x% of unique values for the fitted values compared to nobs.

bwiernik commented 4 months ago

I think that plot is fine, even though it's sort of clustered

I think if it's either of these cases:

  1. It's a discrete model like Poisson or Binomial or Negative Binomial or ordinal (though I think we already have a different plot for binomial)
  2. The number of discrete fitted values is "small", maybe 10 or fewer?

And maybe let's have an argument that can be set to force one form or the other?

mattansb commented 4 months ago

I think if it's either of these cases:

  1. It's a discrete model like Poisson or Binomial or Negative Binomial or ordinal (though I think we already have a different plot for binomial)

But a discrete model ≠ discrete predictions (fitted values), so is this necessary?

bwiernik commented 4 months ago

Yeah actually thinking about it, the homogeneity of variance plot really only applies to Gaussian models.

Maybe we detect based on the predictors all being factors and/binary?

jbohenek commented 1 month ago

Are there currently any arguments that can be supplied that just drop the loess curve or its CIs? Just as a quick fix. Also, forcing the ylims to something reasonable also seemed to help in some tests.

jbohenek commented 1 month ago

A simple ylim() adjustment seems to contain the craziness. Perhaps this an easier route to a solution?

library(performance)
library(ggplot2)
#> Warning: package 'performance' was built under R version 4.1.3
library(see)

set.seed(1)
x <- rpois(360, 1.7)
y <- x + rnorm(length(x))
m <- lm(y ~ x)

#plot & adjust ylim
plot(check_heteroskedasticity(m)) + ylim(0,2)

#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
#> : pseudoinverse used at -0.068209
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
#> : neighborhood radius 2.0697
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
#> : reciprocal condition number 1.7879e-15
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
#> : There are other near singularities as well. 4.1579
#> Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
#> else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
#> -0.068209
#> Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
#> else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
#> 2.0697
#> Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
#> else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
#> number 1.7879e-15
#> Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
#> else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : There are other near
#> singularities as well. 4.1579
#> Warning: Removed 26 rows containing missing values or values outside the scale range
#> (`geom_smooth()`).

000039

Created on 2024-10-23 with reprex v2.1.0

mattansb commented 1 month ago

Setting degree = 0 seems to work:

library(performance)
library(see)

set.seed(1)
x <- rpois(360, 1.7)
y <- x + rnorm(length(x))

m <- lm(y ~ x)

(p <- plot(check_heteroskedasticity(m)))


p$layers[[2]]$stat_params$method.args <- list(degree = 0)
# ggplot2::geom_smooth(..., method.args = list(degree = 0))

p

strengejacke commented 1 month ago

Should we generally set degree = 0 for count/discrete outcomes?

mattansb commented 1 month ago

You mean count/discrete-ish predictions?

I think we might want to default to degree = 0 always (that's what I've started doing...)