HelenaLC / muscat

Multi-sample multi-group scRNA-seq analysis tools
166 stars 33 forks source link

Suggestion for how to approach time the experimental variable with muscat? #34

Closed dpcook closed 4 years ago

dpcook commented 4 years ago

Hi there. Thanks for the work on this package--it's fantastic!

Apologies for a naive question. I'm wondering how you would suggest approaching the analysis with >2 ages/timepoints as the experimental variable. In other settings, I may do something like lm(exp ~ time) or lm(exp ~ splines::ns(time, df=3)) and pull coefficients/p-values out of the model summary as needed.

I haven't worked much with limma/edgeR, so I'm not very familiar how this would be set up. From edgeR's documentation, I think the model.matrix() would work just like lm/gam/glm functions, but I'm not sure how the contrast matrix would get set up.

Any suggestions would be much appreciated!

David

plger commented 4 years ago

Hi, I'm not sure but I doubt ns and such will work in the context of limma/edgeR...

In case it's useful, here's what I normally do: When I analyse timeseries with no prior expectation as to the effects over time, I consider the timepoint as a categorical variable (there's also a paper benchmarking this specifically for RNAseq somewhere, from the Ciaudo lab if I remember correctly, which reaches similar conclusions). Then, rather than working with contrasts, I work directly with the model's coefficients. For example, suppose you have timepoints 0, 1 and 2, transformed to a factor (with 0 as base level), then doing:

mm <- model.matrix(~timepoint)

will create a model with an intercept and 2 coefficients (timepoint1 and timepoint2). You can tell edgeR which coefficient(s) to drop when testing, e.g. this tests whether something changes between timepoints 0 and 1:

res <- glmLRT( fit, coef="timepoint1" )

or this would test if something changes at any timepoint:

res <- glmLRT( fit, coef=c("timepoint1","timepoint2" )

Like edgeR, muscat also accepts specifying either the coef or contrast. There's probably a way of doing exactly the same with contrasts, I'm just not sure how on top of my head.

(Alternatively, if your timepoint variable is numeric (rather than categorical), your model will have a single time coefficient.)

Cheers, Pierre-Luc

dpcook commented 4 years ago

This was really helpful--thank you. Muscat's pbDS() didn't like when I used character coefs, but it worked when I used the appropriate column numbers from the model matrix (ie. c(2,3) here)

seurat.sce$group_id <- factor(seurat.sce$group_id, levels=c("P3", "P7", "P14"))
pb <- aggregateData(seurat.sce,
                    assay = "counts", fun = "sum",
                    by = c("cluster_id", "sample_id"))

mm <- model.matrix(~pb$group_id)
colnames(mm) <- levels(pb$group_id)
rownames(mm) <- colnames(pb)

res <- pbDS(pb, design = mm, coef = 2:3)

I notice that it computes separate logFC values for each time point, which is great, but it only reports a single F statistic and p-value, and writes a coef column that, in this specific case, says P7-P14 down the whole table. Does the p-value correspond to one of the two coefficients, or is it doing something like aggregating P7 and P14 data when fitting?

Because there may be transient changes at the middle time point, I think I'm looking for something like an LRT against a reduced model lacking timepoints.

I appreciate the help!

HelenaLC commented 4 years ago

For coef = 2:3, pbDS (glmQLFTest under the hood) test for any difference between the specified timepoints (H0: the 3 timepoints are identical vs. H1: there is a difference between them), thus, you obtain separate logFCs for each comparison but a single statistic and p-value. I am not sure what exactly you mean by "a reduced model lacking timepoints", but if you want to do a series of pairwise comparisons, pbDS needs a list of coefficients, e.g., coef = list(2, 3) would compare each of P7 and P14 against P3 (the reference factor level).

plger commented 4 years ago

"a reduced model lacking timepoints" is exactly what's happening with coef = 2:3 : what this tests is whether a model lacking the coefficients 2 and 3 explains the data significantly less than a model that has them (which translates into what Helena writes).

dpcook commented 4 years ago

Thanks both! Yes, that's perfect--just wanted to make sure it wasn't doing something weird under the hood when using coef = 2:3 . Thanks for clarifying. And the tip for providing a list to coef is great to know as well.

I appreciate the help!