Closed cfhammill closed 3 years ago
I can't replicate this - I get RMINC's anatLm and summary(lm) matching, both with and without scaling.
> set.seed(1234)
> y2 <- rnorm(100, mean = 20)
> dim(y2) <- c(10,10)
> x2 <- rnorm(100)
> dim(x2) <- c(10,10)
> x2 <- data.frame(x2)
>
> summary(lm(y2[,1] ~ x2$X1))
Call:
lm(formula = y2[, 1] ~ x2$X1)
Residuals:
Min 1Q Median 3Q Max
-1.9606 -0.4275 -0.1747 0.7776 1.4663
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.61773 0.35354 55.489 1.23e-11 ***
x2$X1 0.00563 0.73330 0.008 0.994
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.056 on 8 degrees of freedom
Multiple R-squared: 7.368e-06, Adjusted R-squared: -0.125
F-statistic: 5.895e-05 on 1 and 8 DF, p-value: 0.9941
>
> anatLm(~X1, x2, y2)
N: 10 P: 2
Beginning vertex loop: 10 3
Done with vertex loop
anatModel, showing 6 of 10 rows, and 6 of 7 columns
print more by supplying n and width to the print function
F-statistic R-squared beta-(Intercept) beta-X1 tvalue-(Intercept)
[1,] 5.894804e-05 7.368451e-06 19.61773 0.005630069 55.48900
[2,] 3.342004e-01 4.009988e-02 19.81145 -0.445176521 53.36109
[3,] 8.619882e-01 9.726803e-02 19.54366 -0.432632409 86.99105
[4,] 5.567750e+00 4.103665e-01 19.42244 1.193168580 79.66708
[5,] 1.964103e+00 1.971179e-01 19.50528 0.727917816 77.89133
[6,] 4.847042e-01 5.712682e-02 19.62780 -0.590412942 48.00552
tvalue-X1
[1,] 0.007677763
[2,] -0.578100695
[3,] -0.928433176
[4,] 2.359608079
[5,] 1.401464648
[6,] -0.696207004
>
>
> summary(lm(scale(y2)[,1] ~ x2$X1))
Call:
lm(formula = scale(y2)[, 1] ~ x2$X1)
Residuals:
Min 1Q Median 3Q Max
-1.9689 -0.4293 -0.1755 0.7809 1.4725
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0008938 0.3550383 0.003 0.998
x2$X1 0.0056539 0.7363975 0.008 0.994
Residual standard error: 1.061 on 8 degrees of freedom
Multiple R-squared: 7.368e-06, Adjusted R-squared: -0.125
F-statistic: 5.895e-05 on 1 and 8 DF, p-value: 0.9941
>
>
> anatLm(~X1, x2, scale(y2))
N: 10 P: 2
Beginning vertex loop: 10 3
Done with vertex loop
anatModel, showing 6 of 10 rows, and 6 of 7 columns
print more by supplying n and width to the print function
F-statistic R-squared beta-(Intercept) beta-X1 tvalue-(Intercept)
[1,] 5.894804e-05 7.368451e-06 0.0008938254 0.005653886 0.002517546
[2,] 3.342004e-01 4.009988e-02 -0.0659380526 -0.417090641 -0.189559824
[3,] 8.619882e-01 9.726803e-02 -0.1026950861 -0.649596972 -0.304434213
[4,] 5.567750e+00 4.103665e-01 0.2109359235 1.334273552 0.773717965
[5,] 1.964103e+00 1.971179e-01 0.1461933556 0.924744940 0.459541728
[6,] 4.847042e-01 5.712682e-02 -0.0787017937 -0.497827587 -0.228287007
tvalue-X1
[1,] 0.007677763
[2,] -0.578100695
[3,] -0.928433176
[4,] 2.359608079
[5,] 1.401464648
[6,] -0.696207004
Did my example not work for you Jason? I can see a few differences, in your first example the scale
is in the model formula, this may behave differently than a scale
d column in a data.frame
and in the second example the anatomy is scaled which I don't know if I would have expected to have the same bug
I'm not sure I know exactly where you say the bug is @cfhammill
Doesn't this code at the beginning:
> x <- rnorm(10)
> y <- rnorm(10)
> data <- data.frame(x = scale(rnorm(10)))
Mean the numbers in data
are not the same as those in the object x
, since you called rnorm
again?
Or maybe I'm not seeing something?
oh good point! Thanks Gabe. I'll try to cook up a better example that properly illustrates the bug, this happened in real data.
Here we go a working example that illustrates the bug
data <- data.frame(y = rnorm(10))
data$x <- scale(rnorm(10))
lmod <- lm(y ~ x, data = data)
almod <- anatLm(~ x, data = data, anat = cbind(brain = data$y, behav = rnorm(10)))
lmod$coefficients["x"]
almod["brain", "beta-x"]
I get
> lmod$coefficients["x"]
x
0.1293817
> almod["brain", "beta-x"]
[1] 0.02652394
note it is not possible to do data.frame(y = rnorm(10), x = scale(rnorm(10)))
the data.frame
call is smart enough to drop the attributes.
Thanks @cfhammill. When you say attributes, do you mean there's something special about the column in the data
objects not shown in the example above that is causing the issue? What does that look like? What happens if you strip it back out, does it work properly?
I see it now. Something weird with attributes or classes, as this fixes it:
data <- data.frame(y = rnorm(10))
data$x <- scale(rnorm(10))[,1]
lmod <- lm(y ~ x, data = data)
almod <- anatLm(~ x, data = data, anat = cbind(brain = data$y, behav = rnorm(10)))
lmod$coefficients["x"]
almod["brain", "beta-x"]
> x <- rnorm(10)
> df <- data.frame(x, sx1 = scale(x))
> df$sx2 <- scale(df$x)
> attributes(df$sx1)
NULL
> attributes(df$sx2)
$dim
[1] 10 1
$`scaled:center`
[1] -0.03587015
$`scaled:scale`
[1] 1.283665
When you scale a vector variable it apparently turns it into a matrix (didn't know that) and then adds scaled:center
and scaled:scale
attributes (did know that one). If you pass the scaled variable to data.frame
the attributes get dropped meaning it stops being a matrix and doesn't have those attrs. I guess the dim
attributes (what makes it a matrix) is where the problem is coming from. *Lm
functions handle a matrix on the lhs a special way I think.
@jasonlerch your work around probably drops all attributes as well. I'm reasonably confident the issue is with it being a matrix.
So this bug is very interesting, it triggers for a length 1 RHS formula
y ~ x
the function logic handles this case specially, and I believe the issue is model matrix is left uninitialized in the if statement here, taking the value of matrix()
set up here
You can trigger this bug with a larger model
y ~ x + z
but not if all the RHS terms are scale
d
data <- data.frame(y = rnorm(10), z = rnorm(10))
data$x <- scale(rnorm(10))
lmod <- lm(y ~ x + z, data = data)
almod <- anatLm(~ x + z, data = data, anat = cbind(brain = data$y, behav = rnorm(10)))
lmod$coefficients["x"]
almod["brain", "beta-x"]
recapitulates the bug, but
data <- data.frame(y = rnorm(10))
data$x <- scale(rnorm(10))
data$z <- scale(rnorm(10))
lmod <- lm(y ~ x + z, data = data)
almod <- anatLm(~ x + z, data = data, anat = cbind(brain = data$y, behav = rnorm(10)))
does not. Because the internal code fails in an interesting way when all the terms turn out to be matrices, never producing incorrect out, but failing instead.
So in vertex_lm_loop
if there is a matrix in the RHS of the formula that matrix is expected to have the same shape as anat
, so it's just wandering off into uninitialized memory I'm pretty sure. https://github.com/Mouse-Imaging-Centre/RMINC/blob/master/src/vertex_modelling.c#L121
TLDR: anatLm
needs better error checking, I think we should probably drop the two group comparison functionality @jasonlerch what do you think? vertex_lm_loop
could be cleaned up significantly if we didn't have this feature.
I suspect this bug does not influence vertex or mincLm thankfully.
I've fixed the bug in develop by blocking all the code paths that use the matrix RHS feature. I'll remove the now dead code when I get a chance. Hoping for a Monday release.
If a data frame includes a column that is
scale
d variable with the concomitant attributes it fails dramatically, producing nonsensical t-statistics.NOTE: This example does not demonstrate the bug, but the bug is real, scroll down in the thread for a working example
note the difference between
Estimate
x fromlm
andbrain
beta-x
fromanatLm
same for the t-statsIt is likely this bug affects
vertexLm
andmincLm
as well