Question about DeSeq preparation

deto commented 1 month ago

I'm very interested in trying out this method!

I was wondering - when preparing the data with DeSeq - what preprocessing steps did you typically do? I'm curious about the normalization method as I know this can be finnicky with single-cell data. Also anything else like gene filtering. Possible to share a snippet here?

ajaynadig commented 1 month ago

Hi--thank you for your interest in the method!

I think that the answer to this question will necessarily vary between datasets, but I'm happy to provide some details on what we did in our paper.

First off, I will note that there are actually no DESeq2 models run at the single cell level in our paper--all are pseudobulk, aggregating within perturbation/batch combos. We decided on pseudobulk after running the model at the single cell and pseudobulk level for a subset of perturbations, finding nearly identical results (for example, here are comparisons of transcriptome-wide impact estimates). For those single cell analyses, I had used scran::computesumfactors for normalization, but honestly have not done a deep dive on those analyses in a long time, as we made the decision to use pseudobulk relatively early on.

For the main Weissman lab Perturb Seq datasets, we used a slightly modified version of the default DESeq2 median-of-ratios normalization approach. In particular, to account for the fact that some perturbations had very few cells, we set minmu to be lower than the default, borrowing from the recommended DESeq2 single cell settings:

dds_pseudobulk <- DESeq(dds_pseudobulk, test = "LRT", minmu = 1e-6, minReplicatesForReplace = Inf, reduced = ~ batch + 1, betaPrior = FALSE, parallel = TRUE)

(This command automatically runs DESeq2 median-of-ratios normalization, then fits negative binomial GLM, performing significance testing with a likelihood-ratio test)

Prior to this, we QC'd cells by applying filters on total nUMIs and percent mitochondrial reads following the approach of Replogle 2022, who visually examined a scatterplot of these two metrics and manually chose cutoffs (noting that such cutoffs may differ, for example, if average cell sizes differ between cell lines). Attached is an example of the cutoffs for Jurkat and HepG2, which I chose in consultation with the lead author of Replogle 2022. image (4) . The data are then filtered with a command like:

obj = obj[,obj$mitopercent < 0.14 & obj$UMI_count > 1750]

(Note that these cutoffs have already been applied in the publicly available h5ads of these data if you are interested in using those)

We also subset genes to those with a mean nUMI across cells of 0.01 or greater, also following Replogle 2022. This choice is somewhat more debatable, but we reasoned that below a certain level of expression, estimated log2FCs would be so noisy and unstable that they would not provide much information to estimate the distribution of differential expression effects.

One note re:normalization is that normalization issues may be captured by the TRADE-estimated mean of the effect size distribution. For example, if normalization fails and there is a difference in library size between two conditions, this would result in a nonzero mean log2FC. After running TRADE, you can look at this value to assess whether there is a very obvious issue with normalization. If you want to be a bit more rigorous about this, you can look at this section of the wiki about estimating uncertainty to assess significance of the mean term. Along these lines, I will note that, following the DESeq2 single cell docs, I had originally used scran::computesumfactors for normalization, but found that this lead to a nonzero mean in null simulations, which was resolved after using median-of-ratios (perhaps suggesting that pseudobulk data unsurprisingly behave more like bulk data with respect to normalization)

Hope that helps! Please let me know if any additional details would be helpful, or if you have any other comments/critiques!

deto commented 1 month ago

Thanks for the very detailed response! I had missed that you were pseudobulking when reading the manuscript. Helpful to know your thought process as well. I will let you know if any other questions come up~ (let me know if there is a better venue for that other than Github issues).

deto commented 1 month ago

Ah one other question - I was going through this example and noticed that you had used test = 'LRT' when defining the DESeq object. Would Wald also work here? With LRT it looks like you can only get one p-value for each gene (just comparing, for example ~donor + perturb vs ~donor). Or would you suggest using LRT but just splitting up the dataset and creating separate DESeq objects for each perturbation? (apologies if these are basic DESeq questions! - I was always an edgeR user so I'm just learning their workflow).

ajaynadig commented 1 month ago

Wald should work fine as well here! TRADE doesn't actually use the p values themselves to fit the core mixture model--rather, it uses the p values to subset "significant DE" genes and estimate proportion of signal due to significant DE genes, etc. A slight change in the set of DE genes due to a different approach shouldn't make a huge difference.

More generally, I think that it would probably be wise to split the dataset into separate DESeq objects for each perturbation, and this is the approach we have taken in our paper. My rationale for this is that DESeq2 does not fit a covariate-adjusted model for overdispersion. In a typical DESeq2 analysis with one perturbation, you are making the assumption that overdispersion does not vary between control and perturbed cells, but with a mega-analysis with all perturbations, you are making the stronger assumption that the overdispersion is shared within genes across all perturbations. Additionally, if I understand your model correctly, each perturbation would have its own fixed binary term in the model--for many perturbations, this can become very costly, as the compute time for DESeq2 scales supra-linearly with the number of covariates. That being said, a counterargument is that sharing some information between perturbations may be a good thing. This is probably a worthy empirical experiment to see how the various trade-offs play out.

Lastly, I'll mention that there is no reason in principle that you couldn't run TRADE on edgeR summary statistics, with two caveats:

edgeR does not natively output SEs because the creators are skeptical of SEs from count-based GLMs: https://support.bioconductor.org/p/61640/

Therefore, to do this, you may need to figure out a way to compute SEs yourself from the available model output. For our analyses, we convinced ourselves that DESeq2 SEs were well-calibrated with a mixture of simulations and comparisons with non-parametric methods (i.e. bootstrap)

The typical edgeR workflow includes empirical bayes shrinkage of effect sizes. In general, you do not want to use shrunken effect sizes as input for TRADE, so this step should be skipped.

Hope that helps! Let me know if any other Qs.

deto commented 1 month ago

Yes - I was thinking of doing it as you said, with a term for each perturbation (but shared terms for other covariates, e.g. 'donor'). I didn't realize that about DESeq and their dispersion estimation. EdgeR uses the covariate model when doing this and I was assuming similarly with DESeq. I did notice that run-times were already very long when trying this out with a few dozen perturbations so I will experiment with splitting up the object - thanks for the recommendation!

And good to know your thoughts on what is required for using edgeR for this in the future!

Thanks again!

ajaynadig / TRADE

Question about DeSeq preparation #3