Open gdevenyi opened 4 years ago
This can be generated by the following code:
library(RMINC)
cont <- data.frame(vol = rnorm(10000, mean=10000, sd=800), vol2=rnorm(10000, mean=5000, sd=400), IV=as.factor("Control"))
treat <- data.frame(vol = rnorm(10000, mean=10600, sd=800), vol2=rnorm(10000, mean=4800, sd=400), IV=as.factor("Treatment"))
all<-rbind(cont,treat)
all$vol[1] <- NA
mod<-anatLm(~ IV, data=all, anat=all[1:2])
print(mod)
The behaviour is the same regardless of the setting of na.action:
options(na.action = na.fail)
options(na.action = na.omit)
options(na.action = na.exclude)
options(na.action = NULL)
We should probably match the behaviour of lm
in matrix mode. I'm not sure what it does in this case but we'll do that. anatLm
relies on custom c code where na.action
is never considered.
The model.frame function here: https://svn.r-project.org/R/trunk/src/library/stats/src/model.c handles the application of na checking before the solver is invoked. Applies the passed through na.action (na.fail, na.omit, etc).
I was reading the RMINC code last night and I couldn't quite achieve a way to drop the NA data yet.
I don't know how the lm
code handles missing values in a multivariate response variable though. I will try to investigate soon. Dropping missing data inconsistently across rows strikes me as wrong, the degrees of freedom would be different for each row. I suspect anatFDR
and potentially effect size calculations would be wrong if we did this.
@cfhammill ah, I never conceived of anatLm as being a multivariate response, but rather a convenience-like apply function that understands a bit more metadata than rolling my own, but considering how vertexLm and mincLm work, I guess its closer to multivariate response.
So, here's now lm handles na.omit and na.exclude: https://stats.stackexchange.com/questions/11000/how-does-r-handle-missing-values-in-lm
Most of our code assumes a fixed model matrix for each row, as in the multivariate response case , changing this would be a laborious undertaking. It's a bit of a surprise but you can pass a matrix of response variables to lm
, and it's super quick. If it wasn't for IO buffering and some clever space savings the RMINC *Lm
functions wouldn't be necessary.
Y <- matrix(rnorm(30), ncol = 3)
x <- rnorm(10)
lm(Y ~ x)
Call:
lm(formula = Y ~ x)
Coefficients:
[,1] [,2] [,3]
(Intercept) 0.1374 0.2728 0.2908
x 0.4342 -0.3819 0.3286
A quick experiment
Y[1,1] <- NA
Y[2,2] <- NA
m <- lm(Y ~ x)
nrow(m$model)
[1] 8
shows that R drops whole rows, not independently for each column. I think that your use case is definitely valid though, so I will need to think about how to add it without breaking too many things.
If #282 is addressed, we can work around it with hackery.
I don't think hackery is needed. Just use
RMINC:::matrixApply(anat, function(vols){ m <- lm(vols ~ y, data = d); c(coef(m), df(m), ...) }, )
You'll probably need to work on the processing function a bit, but this will do the trick. This treats each voxel as independent, and you get the default configurable NA
behaviour .
I should really expose matrix apply.
We're working on a project with varying pass-fail for segmentations in MAGeT based on volume. We'd like to selectively include subjects based on their QC. Normally we would just include/exclude whole subjects, but we want maximal numbers for all the volumes.
In most R functions, NAs in a given variable trigger that row being dropped from the analysis. We were hoping to take advantage of that here in anatLm, by masking the failed QC values with NA, so we can use the remaining data for the subject if other labels passed QC.
However, instead, anatLm fails and returns NA for all outputs:
vol had one entry replaced with an NA.