Bias in sjSDM::linear() coefficient estimates when including an intercept term.

mintzj commented 4 months ago

Hello, thank you for your fine work on this package. I am looking forward to using it in my research but there is an issue I need to address first. I have tried to provide some working examples to illustrate the issue.

Models built using sjSDM::linear() appear to create biased coefficient estimates when a fixed effect intercept (environmental) is included. Below I demonstrate the discrepancy using an example dataset, and simulate several test datasets to explore this behavior.

Example 1: datasets::cars

set.seed(40)

# Straight ahead LM: braking distance
data(cars)
head(cars)
cars_lm <- lm(formula = "dist ~ speed", data = cars)
plot(cars, xlim = c(0, 25))
abline(h = 0, v = 0, col = "#33333311", lwd = 2)
abline(cars_lm)
summary(cars_lm)

# sjsdm for LM

distance <- as.matrix(cars$dist, drop = F)
env_linear <- linear(data = cars, formula = ~ speed)

cars_model <- sjSDM(
  Y = distance, 
  env = env_linear, 
  se = TRUE, 
  family = gaussian(link = "identity"), sampling = 100L)

summary(cars_model) # sjSDM:  b0 = 1.7, b1 = 2.8
summary(cars_lm)    # base:   b0 = -17, b1 = 3.9
abline(coef(cars_model)[[1]], col = "#33553399", lwd = 3, lty = "dashed")
legend(x = 0, y = 100, legend = c("LM", "sjSDM-linear"), 
       fill = c("black", "#33553399"))

Example

Scenario 1: Simulating a similar scenario, but with negative data intercept


set.seed(40)
beta0 <- -100
beta1 <- -4
sig <- 15
nobs <- 50

xval = runif(n = nobs, min = 0, max = 25)
mu = xval * beta1 + beta0
yval = rnorm(n = nobs, mean = mu, sd = sig)

slr_dat <- data.frame(xval, yval)
slr <- lm(yval ~ xval)

slr_linear <- linear(data = slr_dat, formula = ~ xval)
slr_model <- sjSDM(
  Y = matrix(yval, ncol = 1), 
  env = slr_linear, 
  se = TRUE, 
  family = gaussian(link = "identity"), sampling = 100L)

summary(slr)
summary(slr_model)

plot(slr_dat, xlim = c(0, 26), ylim = c(-250, 100))
abline(h = 0, v = 0, col = "#33333311", lwd = 2)
abline(slr)
abline(coef(slr_model)[[1]], col = "#33553399", lwd = 3, lty = "dashed")
legend(x = 19, y = 90, legend = c("LM", "sjSDM-linear"), 
       fill = c("black", "#33553399"))

Scenario 1

Scenario 2: No model intercept, no data intercept

Estimated slope appears to be unbiased when data are generated without intercept and model is fit without intercept.


set.seed(40)
beta0 <- 0
beta1 <- -4
sig <- 15
nobs <- 50

xval = runif(n = nobs, min = 0, max = 25)
mu = xval * beta1 + beta0
yval = rnorm(n = nobs, mean = mu, sd = sig)

slr_dat <- data.frame(xval, yval)
slr <- lm(yval ~ xval - 1)

slr_linear <- linear(data = slr_dat, formula = ~ xval + 0)
slr_model <- sjSDM(
  Y = matrix(yval, ncol = 1), 
  env = slr_linear, 
  se = TRUE, 
  family = gaussian(link = "identity"), sampling = 100L)

summary(slr)
summary(slr_model)
# Similar point estimates but variances differ, 
#  which is fine because they arise from different models.

plot(slr_dat, xlim = c(0, 26), ylim = c(-150, 100))
abline(h = 0, v = 0, col = "#33333311", lwd = 1.5)
abline(slr, lwd = 2)
abline(a = 0, b = coef(slr_model)[[1]], col = "#33553399", lwd = 7, lty = "dashed")
legend(x = 19, y = 90, legend = c("LM", "sjSDM-linear"), 
       fill = c("black", "#33553399"))

Scenario 2

Scenario 3: positive intercept, minimal slope


set.seed(40)
beta0 <- 50
beta1 <- 0
sig <- 15
nobs <- 50

xval = runif(n = nobs, min = 0, max = 25)
mu = xval * beta1 + beta0
yval = rnorm(n = nobs, mean = mu, sd = sig)

slr_dat <- data.frame(xval, yval)
slr <- lm(yval ~ xval)

slr_linear <- linear(data = slr_dat, formula = ~ xval)
slr_model <- sjSDM(
  Y = matrix(yval, ncol = 1), 
  env = slr_linear, 
  se = TRUE, 
  family = gaussian(link = "identity"), sampling = 100L)

summary(slr)
summary(slr_model)

plot(slr_dat, xlim = c(0, 26), ylim = c(0, 150))
abline(h = 0, v = 0, col = "#33333311", lwd = 2)
abline(slr)
abline(coef(slr_model)[[1]], col = "#33553399", lwd = 3, lty = "dashed")
legend(x = 0, y = 150, legend = c("LM", "sjSDM-linear"), 
       fill = c("black", "#33553399"))

Scenario 3

Scenario 4: positive intercept, negative slope


set.seed(40)
beta0 <- 100
beta1 <- -4
sig <- 15
nobs <- 50

xval = runif(n = nobs, min = 0, max = 25)
mu = xval * beta1 + beta0
yval = rnorm(n = nobs, mean = mu, sd = sig)

slr_dat <- data.frame(xval, yval)
slr <- lm(yval ~ xval)

slr_linear <- linear(data = slr_dat, formula = ~ xval)
slr_model <- sjSDM(
  Y = matrix(yval, ncol = 1), 
  env = slr_linear, 
  se = TRUE, 
  family = gaussian(link = "identity"), sampling = 100L)

summary(slr)
summary(slr_model)

plot(slr_dat, xlim = c(0, 26), ylim = c(0, 150))
abline(h = 0, v = 0, col = "#33333311", lwd = 2)
abline(slr)
abline(coef(slr_model)[[1]], col = "#33553399", lwd = 3, lty = "dashed")
legend(x = 0, y = 150, legend = c("LM", "sjSDM-linear"), 
       fill = c("black", "#33553399"))

Scenario4

Comments

My initial guess was that the intercept is in some way was not being given the correct weighting or is being shrunk toward zero through regularization, however I was not able to affect fitted intercept by changing the amount regularization. Are there any other settings we could change during modeling to mitigate this issue? If the intercept is incorrectly estimated, the true intercept's contribution to the model will become confounded with the other fixed effect coefficients, biasing their estimates as well.

MaximilianPi commented 4 months ago

Hi @mintzj It seems that the problem is mainly in optimization, the optimizer needs more iterations and a higher learning rate to get such large (absolute) estimates:

slr_linear <- linear(data = slr_dat, formula = ~ xval)
slr_model <- sjSDM(
  Y = matrix(yval, ncol = 1), 
  env = slr_linear, 
  se = TRUE, 
  family = gaussian(link = "identity"), sampling = 100L, iter = 500L, learning_rate = 0.7)

> summary(slr_model)
Family:  gaussian 

LogLik:  -209.5333 
Regularization loss:  0 

Species-species correlation matrix: 

    sp1 1.0000

                Estimate Std.Err Z value Pr(>|z|)    
sp1 (Intercept)  -89.204   4.707   -18.9   <2e-16 ***
sp1 xval          -4.615   0.342   -13.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Although it is still biased, as you have already assumed, there is always a very weak regularization (l2) on all weights, which has a stronger effect on such large effects. Without:

slr_linear <- linear(data = slr_dat, formula = ~ xval)
slr_model <- sjSDM(
  Y = matrix(yval, ncol = 1), 
  env = slr_linear, 
  se = TRUE, 
  family = gaussian(link = "identity"), sampling = 100L, iter = 500L, learning_rate = 0.7, control = sjSDMControl(RMSprop(weight_decay = 0.0)))

> summary(slr_model)
Family:  gaussian 

LogLik:  -207.0031 
Regularization loss:  0 

Species-species correlation matrix: 

    sp1 1.0000

                Estimate Std.Err Z value Pr(>|z|)    
sp1 (Intercept)  -98.836   4.576   -21.6   <2e-16 ***
sp1 xval          -3.975   0.332   -12.0   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Probably I was too optimistic about the number of iterations, but usually we do not have such (absolute) large intercepts with binomial data.

mintzj commented 4 months ago

Is it possible to turn off regularization of the fixed effects in the environmental component, without turning off the regularization for the entire model? It could be useful to allow the fixed effects components to behave as in ordinary regression, but apply regularization to the random effects, since the number of model parameters grows faster with respect to number of species than number of sites or covariates.

MaximilianPi commented 4 months ago

Hi @mintzj,

Yes:

slr_model <- sjSDM(
  Y = matrix(yval, ncol = 1), 
  env = slr_linear, 
  biotic = bioticStruct(lambda= 0.001, alpha = 1.0), # alpha weighting between L1 and L2 
  se = TRUE, 
  family = gaussian(link = "identity"), sampling = 100L, iter = 500L, learning_rate = 0.7, control = sjSDMControl(RMSprop(weight_decay = 0.0)))

TheoreticalEcology / s-jSDM