GabrielHoffman / dreamlet

Perform differential expression analysis on multi-sample single cell datasets using linear mixed models
https://gabrielhoffman.github.io/dreamlet
18 stars 5 forks source link

Different use of random / fixed effects throughout dreamlet processing in example scripts? #22

Open berniejmulvey opened 1 month ago

berniejmulvey commented 1 month ago

Hi,

I've been using your dreamlet package for a number of DE analyses of high-dimension transcriptomic data and came across something that's got me scratching my head in the provided full-analysis scripts examining published data. In many cases, the models used in processAssays() and subsequently in dreamlet() are not the same.

A straightforward example comes from https://github.com/GabrielHoffman/dreamlet_analysis/blob/main/COVID_combat/COVID_combat_major.Rmd , wherein sex and Source are random variables in the processAssays() call but subsequently converted to fixed effects. In other examples, variables used in processAssays are simply dropped from the formula passed to dreamlet:

form = ~ Age + **(1|sex) + (1|Source)**
res.proc = processAssays(pb, form,  
                          min.count = 3,
                          min.cells = 3,
                          BPPARAM = SnowParam(6))

res.vp = fitVarPart(res.proc, form, BPPARAM=SnowParam(6) )

form = ~ Source + **Age + sex**
fit = dreamlet(res.proc, form, BPPARAM=SnowParam(6))

It's somewhat unclear at the level of the code what is prompting these changes in formula between steps, and I can't find much clarifying why one would want to use different variables in the two steps in either the variancePartition or the dream/dreamlet docs. Some pointers would be much appreciated!

Best regards, and great work with this package! -Bernie Mulvey

GabrielHoffman commented 1 month ago

Thanks for brining this up. There are 3 general applications for a formula in this context, and you mentioned all 3. It is a balance statistical performance and computational time. Modeling a variable as a random effect vs a fixed effect matters statistically when the number of levels in the variable is large compared to the sample size, something on the order of $\sqrt{n}$. When the number of levels is small like for sex, there is really no practical difference in the results. But using a random effect is much more computationally intensive.

So for processAssays() I could have just as easily modeled sex and Source as fixed effects. For variance partitioning with fitVarPart(), estimates of variance fractions are more accurate when categorical variables are modeled as random effects. So I used the same formula for both analyses.

I should have used the same formula for dreamlet() for consistency, but there is minimal practical difference in results.

Gabriel