RGLab / MAST

Tools and methods for analysis of single cell assay data in R
224 stars 57 forks source link

Nested batch effects in MAST #97

Open LuckyMD opened 6 years ago

LuckyMD commented 6 years ago

Hi!

I was wondering if you might be able to help me with an issue I've been having. I am trying to run MAST with a nested batch effect.

I have data from an experiment with 6 mice in 2 conditions (3 mice each), spread over 6 plates (condition & control spread half-half per plate). Now I'd like to run MAST on normalized, but not batch-corrected data and include the batch as a covariate.

First I ran ~condition + sample + chip + ngeneson, where sample is just the mouse. I got expected genes out of this, although down-regulated when I expected them to be up-regulated. Then, I noticed that sample is just a higher resolution of condition (each mouse is either condition or control). So that would lead to a non-full-rank design matrix and should have output wrong results I assume.

Running just ~condition + chip + ngeneson however gives me far fewer DE genes, and not the ones I would expect. I assume this has to do with the variability between mice making the background noise quite high.

I was wondering if I can add the mouse covariate so that it is fit separately per condition to account for variation between mice without interfering with the condition covariate.

Thanks for any help you can provide!

gfinak commented 6 years ago

I don't understand your design from your description. Can you clarify, please? Perhaps post your design matrix?

LuckyMD commented 6 years ago

Conceptually I set up my design matrix using the variables:

Condition Sample Chip ngeneson TRUE. A C1. 1030 FALSE. B C1 2025 FALSE. B. C2. 2583 TRUE. A. C2. 2353 TRUE. C. C3. 1843 FALSE. D. C3. 1283 ......

I assume this would have been transformed into 0s and 1s automatically and an offset would have been added. What I mean to say is that in this toy example Condition==TRUE includes samples A and C and Condition==FALSE includes samples B and D. Thus, if I include both condition and sample in the design matrix this should not be full rank as condition can be inferred from sample.

However, I did run this setup... and I recovered differentially expressed genes over the condition covariate. This puzzles me a little, as I would assume the variance in condition would have already been accounted for in the sample covariate.

Now, when I run without the sample covariate, I recover very different DEGs over condition.

Is there any way to account for sample variance within condition in the model?

gfinak commented 6 years ago

Maybe use a random effect model with method='glmer' but with only three samples and three chips you're not going to get great estimates of the variability. Something like: ~Condition+ngeneson+(1|Sample)+(1|Chip). And center your ngeneson variable. @amcdavid any thoughts? Also, as a warning there's issue #98 that's related to estimation with random effects that we need to look into still.

LuckyMD commented 6 years ago

I just used three in the example above. I actually have 8 chips and 8 samples split over 2 conditions.

Does centering the ngeneson variable make a big difference? That would be something I could alter.

I'm not very familiar with mixed effect models. What exactly would I be modeling using (1|Sample) + (1|Chip)?

Do you have any idea why I would get an output at all with the above design matrix (including sample and condition)? How can MAST fit a model where it could assign the variance to either the sample or condition covariate?

gfinak commented 6 years ago
LuckyMD commented 5 years ago

Hi again! I recently tried your suggested solution of using the following zlm command: zlmCond_astro <- zlm(formula = ~condition + (1|sample) + (1|chip) + ngeneson, method='glmer', ebayes=FALSE, sca=sca_astro_filt, parallel=FALSE)

However, I am running into an error regarding a fixef() function. Could you decipher the following error message for me?

Error in fixef(object@fitC) : could not find function "fixef"

gfinak commented 5 years ago

Maybe load the lme4 package first?

Greg Finak

On Wed, Dec 5, 2018, 09:52 MalteDLuecken <notifications@github.com wrote:

Hi again! I recently tried your suggested solution of using the following zlm command: zlmCond_astro <- zlm(formula = ~condition + (1|sample) + (1|chip) + ngeneson, method='glmer', ebayes=FALSE, sca=sca_astro_filt, parallel=FALSE)

However, I am running into an error regarding a fixef() function. Could you decipher the following error message for me?

Error in fixef(object@fitC) : could not find function "fixef"

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/RGLab/MAST/issues/97#issuecomment-444579297, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUkeQmB4GTJ4nEpX6cQSeR68YQW_NCjks5u2AfAgaJpZM4VqjPH .

LuckyMD commented 5 years ago

Ah, thanks. I wasn't aware of other dependencies. I assumed I had accidentally renamed a fixed() function and was looking for my error ;). Mea culpa.

amcdavid commented 5 years ago

I think we have good reasons to have lme4 be only Suggested. Can we import its methods anyways? Putting a call to if(require(lme4)) seems like bad behavior since it modifies the search path (and I think is worthy of a warning from R CMD check

fbrundu commented 4 years ago

Hi all, I post here since the topic is the same and this issue is still open, but I can open a new issue if necessary. I would like to compare two conditions with nested batches, but with a simpler design matrix (I have no chip covariate). The composition of each condition is similar to:

condition  batch     
Case       Sample1   
Case       Sample2
Case       Sample3
Case       Sample4
Control    Sample5
Control    Sample6
Control    Sample7
Control    Sample8

In such case, with no chip covariate, does it make sense to also model the batch or would it lead to losing the condition signal, i.e.:

~ condition + ngeneson + batch

? At the moment I'm using MAST without the batch covariate

~ condition + ngeneson

since I thought introducing the batch covariate would lead to the wrong results in this case. Is this justified, or I should be able to use it instead?

Thanks

gfinak commented 4 years ago

I assume you have multiple cells per sample? If so I think you could include a random batch effect.

fbrundu commented 4 years ago

Thanks @gfinak, yes I have multiple cells per sample. I'll try with the batch covariate. I would like to learn more about why the batch covariate doesn't remove the condition. In a previous message, you indicate that "Sample and condition are not entirely aliased (you have more samples than conditions)." Do you have at hand some documentation I can read to understand better how the MAST's model works in this case? Thanks!

fbrundu commented 4 years ago

I tested using the batch covariate. I noticed that, although some genes related to the condition (positive controls) are recovered, most of the others are not. And the (log-fold-change, FDR) of the recovered genes is smaller (LFC) and less significant (FDR) with respect to the analysis without the batch covariate.

Is it correct to state that in this case, due to conditions partially confounded with batch (4 unique batches per condition), the statistical power of the test is reduced when adding the batch covariate? Keeping however in mind that adding the batch covariate is the more appropriate approach... Thanks.

noobCoding commented 3 years ago

Hi @fbrundu, I had the same problem when using MAST with the batch covariate. Has the problem with batch covariate been solved, somehow?