[Critical] Bug in anatLm with `scale`d variables

cfhammill commented 3 years ago

If a data frame includes a column that is scaled variable with the concomitant attributes it fails dramatically, producing nonsensical t-statistics.

NOTE: This example does not demonstrate the bug, but the bug is real, scroll down in the thread for a working example

> x <- rnorm(10)
> y <- rnorm(10)
> data <- data.frame(x = scale(rnorm(10)))
> 
> summary(lm(y ~ x))

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1389 -0.3247 -0.0495  0.1115  1.5012 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.03958    0.25557   0.155    0.881
x            0.13819    0.35190   0.393    0.705

Residual standard error: 0.7629 on 8 degrees of freedom
Multiple R-squared:  0.01891,   Adjusted R-squared:  -0.1037 
F-statistic: 0.1542 on 1 and 8 DF,  p-value: 0.7048

> 
> anatLm(~ x, data = data, anat = cbind(brain = y, behav = rnorm(10)))
N: 10 P: 2
Beginning vertex loop: 2 3
Done with vertex loop
anatModel, showing  2  of  2  rows, and  6  of  7 columns
print more by supplying n and width to the print function
      F-statistic   R-squared beta-(Intercept)      beta-x tvalue-(Intercept)   tvalue-x
brain  0.27313739 0.033014971        0.0727127 -0.13194255          0.3035951 -0.5226255
behav  0.07333479 0.009083581        0.5887849  0.08014675          2.0970295  0.2708040

note the difference between Estimate x from lm and brain beta-x from anatLm same for the t-stats

It is likely this bug affects vertexLm and mincLm as well

jasonlerch commented 3 years ago

I can't replicate this - I get RMINC's anatLm and summary(lm) matching, both with and without scaling.

> set.seed(1234)
> y2 <- rnorm(100, mean = 20)
> dim(y2) <- c(10,10)
> x2 <- rnorm(100)
> dim(x2) <- c(10,10)
> x2 <- data.frame(x2)
> 
> summary(lm(y2[,1] ~ x2$X1))

Call:
lm(formula = y2[, 1] ~ x2$X1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9606 -0.4275 -0.1747  0.7776  1.4663 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 19.61773    0.35354  55.489 1.23e-11 ***
x2$X1        0.00563    0.73330   0.008    0.994    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.056 on 8 degrees of freedom
Multiple R-squared:  7.368e-06, Adjusted R-squared:  -0.125 
F-statistic: 5.895e-05 on 1 and 8 DF,  p-value: 0.9941

> 
> anatLm(~X1, x2, y2)
N: 10 P: 2
Beginning vertex loop: 10 3
Done with vertex loop
anatModel, showing  6  of  10  rows, and  6  of  7 columns
print more by supplying n and width to the print function
      F-statistic    R-squared beta-(Intercept)      beta-X1 tvalue-(Intercept)
[1,] 5.894804e-05 7.368451e-06         19.61773  0.005630069           55.48900
[2,] 3.342004e-01 4.009988e-02         19.81145 -0.445176521           53.36109
[3,] 8.619882e-01 9.726803e-02         19.54366 -0.432632409           86.99105
[4,] 5.567750e+00 4.103665e-01         19.42244  1.193168580           79.66708
[5,] 1.964103e+00 1.971179e-01         19.50528  0.727917816           77.89133
[6,] 4.847042e-01 5.712682e-02         19.62780 -0.590412942           48.00552
        tvalue-X1
[1,]  0.007677763
[2,] -0.578100695
[3,] -0.928433176
[4,]  2.359608079
[5,]  1.401464648
[6,] -0.696207004
> 
> 
> summary(lm(scale(y2)[,1] ~ x2$X1))

Call:
lm(formula = scale(y2)[, 1] ~ x2$X1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9689 -0.4293 -0.1755  0.7809  1.4725 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0008938  0.3550383   0.003    0.998
x2$X1       0.0056539  0.7363975   0.008    0.994

Residual standard error: 1.061 on 8 degrees of freedom
Multiple R-squared:  7.368e-06, Adjusted R-squared:  -0.125 
F-statistic: 5.895e-05 on 1 and 8 DF,  p-value: 0.9941

> 
> 
> anatLm(~X1, x2, scale(y2))
N: 10 P: 2
Beginning vertex loop: 10 3
Done with vertex loop
anatModel, showing  6  of  10  rows, and  6  of  7 columns
print more by supplying n and width to the print function
      F-statistic    R-squared beta-(Intercept)      beta-X1 tvalue-(Intercept)
[1,] 5.894804e-05 7.368451e-06     0.0008938254  0.005653886        0.002517546
[2,] 3.342004e-01 4.009988e-02    -0.0659380526 -0.417090641       -0.189559824
[3,] 8.619882e-01 9.726803e-02    -0.1026950861 -0.649596972       -0.304434213
[4,] 5.567750e+00 4.103665e-01     0.2109359235  1.334273552        0.773717965
[5,] 1.964103e+00 1.971179e-01     0.1461933556  0.924744940        0.459541728
[6,] 4.847042e-01 5.712682e-02    -0.0787017937 -0.497827587       -0.228287007
        tvalue-X1
[1,]  0.007677763
[2,] -0.578100695
[3,] -0.928433176
[4,]  2.359608079
[5,]  1.401464648
[6,] -0.696207004

cfhammill commented 3 years ago

Did my example not work for you Jason? I can see a few differences, in your first example the scale is in the model formula, this may behave differently than a scaled column in a data.frame and in the second example the anatomy is scaled which I don't know if I would have expected to have the same bug

gdevenyi commented 3 years ago

I'm not sure I know exactly where you say the bug is @cfhammill

Doesn't this code at the beginning:

> x <- rnorm(10)
> y <- rnorm(10)
> data <- data.frame(x = scale(rnorm(10)))

Mean the numbers in data are not the same as those in the object x, since you called rnorm again?

Or maybe I'm not seeing something?

cfhammill commented 3 years ago

oh good point! Thanks Gabe. I'll try to cook up a better example that properly illustrates the bug, this happened in real data.

cfhammill commented 3 years ago

Here we go a working example that illustrates the bug

data <- data.frame(y = rnorm(10))
data$x <- scale(rnorm(10))

lmod <- lm(y ~ x, data = data)
almod <- anatLm(~ x, data = data, anat = cbind(brain = data$y, behav = rnorm(10)))

lmod$coefficients["x"]
almod["brain", "beta-x"]

I get

> lmod$coefficients["x"]
        x 
0.1293817 
> almod["brain", "beta-x"]
[1] 0.02652394

note it is not possible to do data.frame(y = rnorm(10), x = scale(rnorm(10))) the data.frame call is smart enough to drop the attributes.

gdevenyi commented 3 years ago

Thanks @cfhammill. When you say attributes, do you mean there's something special about the column in the data objects not shown in the example above that is causing the issue? What does that look like? What happens if you strip it back out, does it work properly?

jasonlerch commented 3 years ago

I see it now. Something weird with attributes or classes, as this fixes it:

data <- data.frame(y = rnorm(10))
data$x <- scale(rnorm(10))[,1]

lmod <- lm(y ~ x, data = data)
almod <- anatLm(~ x, data = data, anat = cbind(brain = data$y, behav = rnorm(10)))

lmod$coefficients["x"]
almod["brain", "beta-x"]

cfhammill commented 3 years ago

> x <- rnorm(10)
> df <- data.frame(x, sx1 = scale(x))
> df$sx2 <- scale(df$x)
> attributes(df$sx1)
NULL
> attributes(df$sx2)
$dim
[1] 10  1

$`scaled:center`
[1] -0.03587015

$`scaled:scale`
[1] 1.283665

When you scale a vector variable it apparently turns it into a matrix (didn't know that) and then adds scaled:center and scaled:scale attributes (did know that one). If you pass the scaled variable to data.frame the attributes get dropped meaning it stops being a matrix and doesn't have those attrs. I guess the dim attributes (what makes it a matrix) is where the problem is coming from. *Lm functions handle a matrix on the lhs a special way I think.

cfhammill commented 3 years ago

@jasonlerch your work around probably drops all attributes as well. I'm reasonably confident the issue is with it being a matrix.

cfhammill commented 3 years ago

So this bug is very interesting, it triggers for a length 1 RHS formula

y ~ x

the function logic handles this case specially, and I believe the issue is model matrix is left uninitialized in the if statement here, taking the value of matrix() set up here

You can trigger this bug with a larger model

y  ~ x + z

but not if all the RHS terms are scaled

data <- data.frame(y = rnorm(10), z = rnorm(10))
data$x <- scale(rnorm(10))
lmod <- lm(y ~ x + z, data = data)
almod <- anatLm(~ x + z, data = data, anat = cbind(brain = data$y, behav = rnorm(10)))

lmod$coefficients["x"]
almod["brain", "beta-x"]

recapitulates the bug, but

data <- data.frame(y = rnorm(10))
data$x <- scale(rnorm(10))
data$z <- scale(rnorm(10))

lmod <- lm(y ~ x + z, data = data)
almod <- anatLm(~ x + z, data = data, anat = cbind(brain = data$y, behav = rnorm(10)))

does not. Because the internal code fails in an interesting way when all the terms turn out to be matrices, never producing incorrect out, but failing instead.

cfhammill commented 3 years ago

So in vertex_lm_loop if there is a matrix in the RHS of the formula that matrix is expected to have the same shape as anat, so it's just wandering off into uninitialized memory I'm pretty sure. https://github.com/Mouse-Imaging-Centre/RMINC/blob/master/src/vertex_modelling.c#L121

cfhammill commented 3 years ago

TLDR: anatLm needs better error checking, I think we should probably drop the two group comparison functionality @jasonlerch what do you think? vertex_lm_loop could be cleaned up significantly if we didn't have this feature.

cfhammill commented 3 years ago

I suspect this bug does not influence vertex or mincLm thankfully.

cfhammill commented 3 years ago

I've fixed the bug in develop by blocking all the code paths that use the matrix RHS feature. I'll remove the now dead code when I get a chance. Hoping for a Monday release.

Mouse-Imaging-Centre / RMINC

[Critical] Bug in anatLm with `scale`d variables #294