RGLab / MAST

Tools and methods for analysis of single cell assay data in R
224 stars 57 forks source link

Parallelization for summary() with doLRT; Passing a design matrix to zlm() #169

Closed ImNotaGit closed 2 years ago

ImNotaGit commented 2 years ago

I would like to report two issues, or rather, make two feature requests.

First one. I am trying to perform differential expression analysis on single-cell RNA-seq datasets using the latest Bioconductor release version (v1.20.0), following the tutorial here. To enable parallel computation with multiple cores, I use the parallel=TRUE in zlm, i.e. zlm_fit <- zlm(formula, sca, parallel=TRUE). This seems to work fine. Then I try to do the LR test with summary(zlm_fit, doLRT='something'), but here I do not see how I can enable parallelization. It seems that any additional arguments passed to summary is ignored. Given that the LR test involves refitting the null model and is at least as time consuming as the fitting the full model, I think it should make sense to also allow the user to enable parallelization here.

The second one is that I would like to pass a design matrix (i.e. what one would get with model.matrix) to zlm, instead of an R formula, as for certain complex designs using R formula does not give what I need the design to be. I suppose that currently this could be done via passing the LMlike argument. However, in the documentation it is written that "This is intended for internal use", and it's not clear how an user can create an LMlike object. It seems that there is a slot called modelMatrix within the internally generated LMlike object that stores the design matrix. It would be great if the zlm function can simply accept another design argument to pass a customized design matrix, and when it's provided, the formula argument will be ignored.

Thanks.

amcdavid commented 2 years ago

Hi @ImNotaGit, you can try installing the latest development version from github or bioconductor regarding your question about parallelization.

For the design matrix, you can just expand it manually and write it colData(sca), then write out a formula naming each of the columns in the design (+0):

data("vbetaFA")
mm = model.matrix(~ Population/Time, data = colData(vbetaFA))
colnames(mm) = paste0('X', seq_len(ncol(mm)))
colData(vbetaFA) = cbind(colData(vbetaFA), mm)
zlm(~ 0 + X1 + X2 + X3 + X4 + X5, vbetaFA)

The actual formula is used in various places downstream of zlm, so I am suspecting it would not be such an easy tweak to allow the model matrix to be specified directly.