carmonalab / UCell

Gene set scoring for single-cell data
GNU General Public License v3.0
136 stars 16 forks source link

Significant results with random gene signatures #41

Closed mattiabattistella closed 1 month ago

mattiabattistella commented 1 month ago

Dear developers, I am currently using UCell version 2.6.2 in a possibly unconventional way. Specifically, I am scoring up- and down-regulated disease signatures in my single-cell population dataset, which includes both control and disease cells. After running UCell scoring on the normalized counts, I split the cells into healthy and disease groups, plot the score distributions, and perform a Mann-Whitney test to assess whether the distribution in disease cells is statistically higher (for the up-regulated signature) or lower (for the down-regulated signature) compared to the control group. I am using this analysis as an exploratory tool, and my hypothesis appears to be supported by the results. However, when I randomly select 100 signatures of the same length as my up- or down-regulated signature from a pool of 30,000 genes, I observe that in only 10% of cases the differences between control and disease are not significant. In the remaining cases, there is a roughly 50% chance of obtaining either a statistically significant up- or down-regulation. I expected to obtain for the vast majority of the cases not significant results. I wanted to ask for your input on whether this behavior aligns with your expectations of UCell's functionality, or if you have any insights on this approach. Thank you for your time and consideration. Best regards,

mass-a commented 1 month ago

Hello Mattia, thanks for the message. I think this is a more general issue with statistics in single-cell omics: considering the cell as the biological replicate. Because it is now common to collect tens of thousands of cells in a given experiment, statistical tests often blow up to high significance even with tiny or artifactual signal. Do you have multiple samples in your dataset? what we tend to do in these situations is to use individual samples as the biological replicate, instead of the cell. In your case you could e.g. calculate the average UCell score by sample for your population/gene set of interest, and then apply the statistical test on these by-sample averages. In this case the sample would be your biological replicate. Does that make sense?

mattiabattistella commented 1 month ago

Thanks for the fast reply and yes, it makes sense. Thanks!