YosefLab / Compass

In-Silico Modeling of Metabolic Heterogeneity using Single-Cell Transcriptomes
BSD 3-Clause "New" or "Revised" License
93 stars 25 forks source link

Pseudo-bulk or mixed-effects modeling approach to differential testing of reaction scores #103

Closed mariecrane closed 8 months ago

mariecrane commented 9 months ago

Thank you for developing this great tool! I am wondering if you have considered other approaches to testing differential reaction scores between conditions or cell types of interest. Your current suggested approach is to use a Wilcoxon rank-sum test, but this approach does not take into account the pseudo-replication of multiple cells from a single biological sample. This introduces bias when there are multiple samples from the same experimental condition, since cells from the same sample are not statistically independent.

The strategies to address pseudo-replication in differential expression analysis of single-cell RNA-seq data are pseudo-bulk or mixed-effects modeling. What do you think of applying these strategies to analyze Compass results? I know Compass theoretically works on bulk RNA-seq data, so I've considered aggregating (summing) my counts across cells within each sample/cell type to generate pseudo-bulk data and running Compass on that. An added benefit of this approach is that it would drastically reduce the computing time since it would only have to calculate penalties for a handful of samples instead of thousands of cells. However, this would reduce the granularity provided by single-cell measurements. I have also considered modeling reaction scores using a mixed-effects model with the donor/sample as a random effect, which is another way to account for pseudo-replication.

What do you think of these approaches? I would greatly value your thoughts or concerns about their validity in this context.

allonw commented 8 months ago

Hi Marie,

Thank you for the thoughtful comment! You are actually spot on with the way we have recently used and recommend others to use the software. To recap what you wrote, computing pseudobulks in a granularity that matches the desired analysis (e.g., aggregate counts for cells of a certain type within a certain sample) and then follow with a fixed- or mixed effects model. This has both statistical benefits (not treating cells as independent samples, i.e., pseudo-replication) and cuts down the runtime.

I reserve the right to refer to your comment next time someone asks me a related question :)