Open mattansb opened 2 years ago
Could be related to #642, where @bwiernik suggested using a different smooth function for "non-continuous" scales.
Yeah
For these plots, we use fitted()
for the x-axis, and scaled residuals for the y-axis. But when do we decide whether the x-axis is "categorical"? E.g., adding a continuous variable to the model makes the plot looking much more "usual":
set.seed(1)
d <- data.frame(x = rpois(360, 1.7), x2 = rnorm(360))
d$y <- d$x + rnorm(length(d$x))
m <- lm(y ~ x + x2, data = d)
performance::check_heteroscedasticity(m) |> plot()
fitted()
in the above case return 360 unique values, the same as the number of observations.
For Mattan's example, fitted()
returns 27 unique values, much less than the 360 observations:
set.seed(1)
d <- data.frame(x = rpois(360, 1.7))
d$y <- d$x + rnorm(length(d$x))
m <- lm(y ~ x, data = d)
length(unique(fitted(m)))
#> [1] 27
We must either think of a way how to determine the "spread" of data points across the x axis (even in the first example, they all "spread" around integer values), or whether we want to have at least x% of unique values for the fitted values compared to nobs.
I think that plot is fine, even though it's sort of clustered
I think if it's either of these cases:
And maybe let's have an argument that can be set to force one form or the other?
I think if it's either of these cases:
- It's a discrete model like Poisson or Binomial or Negative Binomial or ordinal (though I think we already have a different plot for binomial)
But a discrete model ≠ discrete predictions (fitted values), so is this necessary?
Yeah actually thinking about it, the homogeneity of variance plot really only applies to Gaussian models.
Maybe we detect based on the predictors all being factors and/binary?
Are there currently any arguments that can be supplied that just drop the loess curve or its CIs? Just as a quick fix. Also, forcing the ylims to something reasonable also seemed to help in some tests.
A simple ylim() adjustment seems to contain the craziness. Perhaps this an easier route to a solution?
library(performance)
library(ggplot2)
#> Warning: package 'performance' was built under R version 4.1.3
library(see)
set.seed(1)
x <- rpois(360, 1.7)
y <- x + rnorm(length(x))
m <- lm(y ~ x)
#plot & adjust ylim
plot(check_heteroskedasticity(m)) + ylim(0,2)
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
#> : pseudoinverse used at -0.068209
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
#> : neighborhood radius 2.0697
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
#> : reciprocal condition number 1.7879e-15
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
#> : There are other near singularities as well. 4.1579
#> Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
#> else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
#> -0.068209
#> Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
#> else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
#> 2.0697
#> Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
#> else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
#> number 1.7879e-15
#> Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
#> else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : There are other near
#> singularities as well. 4.1579
#> Warning: Removed 26 rows containing missing values or values outside the scale range
#> (`geom_smooth()`).
Created on 2024-10-23 with reprex v2.1.0
Setting degree = 0
seems to work:
library(performance)
library(see)
set.seed(1)
x <- rpois(360, 1.7)
y <- x + rnorm(length(x))
m <- lm(y ~ x)
(p <- plot(check_heteroskedasticity(m)))
p$layers[[2]]$stat_params$method.args <- list(degree = 0)
# ggplot2::geom_smooth(..., method.args = list(degree = 0))
p
Should we generally set degree = 0
for count/discrete outcomes?
You mean count/discrete-ish predictions?
I think we might want to default to degree = 0
always (that's what I've started doing...)
What's this weirdness?
Created on 2022-03-24 by the reprex package (v2.0.1)