anatLm doesn't allow for NAs in depdendent variable

Mouse-Imaging-Centre / RMINC

Statistics for MINC volumes: A library to integrate voxel-based statistics for MINC volumes into the R environment. Supports getting and writing of MINC volumes, running voxel-wise linear models, correlations, etc.; correcting for multiple comparisons using the False Discovery Rate, and more. With contributions from Jason Lerch, Chris Hammill, Jim Nikelski and Matthijs van Eede. Some additional information can be found here:

https://mouse-imaging-centre.github.io/RMINC

Other

22 stars 17 forks source link

anatLm doesn't allow for NAs in depdendent variable #280

Open gdevenyi opened 4 years ago

gdevenyi commented 4 years ago

We're working on a project with varying pass-fail for segmentations in MAGeT based on volume. We'd like to selectively include subjects based on their QC. Normally we would just include/exclude whole subjects, but we want maximal numbers for all the volumes.

In most R functions, NAs in a given variable trigger that row being dropped from the analysis. We were hoping to take advantage of that here in anatLm, by masking the failed QC values with NA, so we can use the remaining data for the subject if other labels passed QC.

However, instead, anatLm fails and returns NA for all outputs:

anatModel, showing  2  of  2  rows, and  6  of  7 columns
print more by supplying n and width to the print function
     F-statistic R-squared beta-(Intercept) beta-DXBig tvalue-(Intercept) tvalue-DXBig
vol           NA        NA               NA         NA                 NA           NA
vol2    185.7611 0.0444005          4989.06   -176.596           544.5404    -13.62942

vol had one entry replaced with an NA.

dnmacdon commented 4 years ago

This can be generated by the following code:

library(RMINC)

cont  <- data.frame(vol = rnorm(10000, mean=10000, sd=800), vol2=rnorm(10000, mean=5000, sd=400), IV=as.factor("Control"))
treat <- data.frame(vol = rnorm(10000, mean=10600, sd=800), vol2=rnorm(10000, mean=4800, sd=400), IV=as.factor("Treatment"))
all<-rbind(cont,treat)

all$vol[1] <- NA
mod<-anatLm(~ IV, data=all, anat=all[1:2])
print(mod)

The behaviour is the same regardless of the setting of na.action:

options(na.action = na.fail)

options(na.action = na.omit)

options(na.action = na.exclude)

options(na.action = NULL)

cfhammill commented 4 years ago

We should probably match the behaviour of lm in matrix mode. I'm not sure what it does in this case but we'll do that. anatLm relies on custom c code where na.action is never considered.

gdevenyi commented 4 years ago

The model.frame function here: https://svn.r-project.org/R/trunk/src/library/stats/src/model.c handles the application of na checking before the solver is invoked. Applies the passed through na.action (na.fail, na.omit, etc).

I was reading the RMINC code last night and I couldn't quite achieve a way to drop the NA data yet.

cfhammill commented 4 years ago

I don't know how the lm code handles missing values in a multivariate response variable though. I will try to investigate soon. Dropping missing data inconsistently across rows strikes me as wrong, the degrees of freedom would be different for each row. I suspect anatFDR and potentially effect size calculations would be wrong if we did this.

gdevenyi commented 4 years ago

@cfhammill ah, I never conceived of anatLm as being a multivariate response, but rather a convenience-like apply function that understands a bit more metadata than rolling my own, but considering how vertexLm and mincLm work, I guess its closer to multivariate response.

So, here's now lm handles na.omit and na.exclude: https://stats.stackexchange.com/questions/11000/how-does-r-handle-missing-values-in-lm

cfhammill commented 4 years ago

Most of our code assumes a fixed model matrix for each row, as in the multivariate response case , changing this would be a laborious undertaking. It's a bit of a surprise but you can pass a matrix of response variables to lm, and it's super quick. If it wasn't for IO buffering and some clever space savings the RMINC *Lm functions wouldn't be necessary.

Y <- matrix(rnorm(30), ncol = 3)
x <- rnorm(10)

lm(Y ~ x)


Call:
lm(formula = Y ~ x)

Coefficients:
             [,1]     [,2]     [,3]
(Intercept)   0.1374   0.2728   0.2908
x             0.4342  -0.3819   0.3286

A quick experiment

Y[1,1] <- NA
Y[2,2] <- NA

m <- lm(Y ~ x)

nrow(m$model)

[1] 8

shows that R drops whole rows, not independently for each column. I think that your use case is definitely valid though, so I will need to think about how to add it without breaking too many things.

gdevenyi commented 4 years ago

If #282 is addressed, we can work around it with hackery.

cfhammill commented 4 years ago

I don't think hackery is needed. Just use

RMINC:::matrixApply(anat, function(vols){ m <- lm(vols ~ y, data = d); c(coef(m), df(m), ...) }, )

You'll probably need to work on the processing function a bit, but this will do the trick. This treats each voxel as independent, and you get the default configurable NA behaviour .

cfhammill commented 4 years ago

I should really expose matrix apply.