carmonalab / UCell

Gene set scoring for single-cell data
GNU General Public License v3.0
132 stars 16 forks source link

How to tune UCell for being less sensitive to number of captured genes #39

Closed zvittorio closed 2 weeks ago

zvittorio commented 1 month ago

Hi UCell people!

First of all, thanks for the amazing resource. UCell is super flexible and scalable, features other tools lack. I have applied UCell to score a bunch of gene set that vary a lot in size. My dataset combines the same cell type across several published cohorts. I have noticed that I get systematically lower UCell scores in cells that have low UMI counts and number of expressed genes, and vice versa. This is the distribution of the number of the expressed genes in my dataset: image

Most of the cells have somewhere around 2000 expressed genes, but there is a very long tail to high values. Is the result expected, given this distribution (few expressed genes = low UCell scores and vv)? Then, given that doublets were already removed, and that UCell scores the gene sets on each cell individually, I think that what could make a difference is tuning maxRank (also based on #25 ). Is this correct? If so, in what direction should I change it: higher or lower than default? Are there any other parameters or "tricks" to make UCell less sensitive to the number of expressed genes?

Thanks for the help!

Vittorio

mass-a commented 1 month ago

Hello Vittorio, thanks for the kind words!

In principle, there shouldn't be a design reason for UCell scores to be systematically lower in cells with lower UMI counts. UCell scores are based on ranks, and the downsampling of a distribution (i.e. reduced sequencing depth) should not affect the relative ranking of the genes in terms of expression. That is if the downsampling is uniform across the space of genes, which it may not be. Do you think that cells with lower UMI counts tend to be enriched/depleted in certain classes of genes, e.g. mitochondrial, ribosomal or other?

By the way, did you quantify this effect of UMI counts vs. UCell score correlation? do you see it across datasets, or also within individual samples? I would be interested to see these results if you have them, and continue the discussion.

Best -massimo

zvittorio commented 1 month ago

Hi Massimo

Thanks for the prompt reply. I haven't noticed a particular enrichment or depletion of some classes of genes, and I have checked those to which I have easy access like mitochondrial genes. So I would exclude that. I quickly checked the correlation between each gene signature and the number of expressed genes aka nFeature_RNA for Seurat objects. This is the distribution of Spearman's rho values across all signatures:

image

Thanks to your suggestion, I have noticed that there is one study that mostly contributes to the higher values of this distribution, being the only one where rho reaches values higher than 0.7 (coincidentally the biggest study, accounting for half of the cells), while the other don't go futher than 0.6. My dataset includes single cells and single nuclei, and what I noticed is that nFeature is lower in single nuclei studies, but also in one of the single cell studies. Notably the studies where the correlations are the lowest are single nuclei studies. But I can't directly see how this discrepancy can affect UCell scores, since they are computed on each cell individually. Another observation I'd like to report is that very big signatures tend to be less affected by nFeature. I am not sure how to interpret this, but probably it argues for an increase in maxRank?
So I performed a very quick and dirty parameter sweep for maxRank, to find the setting where this would give lower correlations. Values I tested are 500, 1500 (default), 3000, and 5000, the latter corresponding to all genes I have at this stage of the analysis. The lowest correlation values on average are obtained with maxRank = 5000, but the distribution is almost centered at 0 and therefore also contains negative correlation values, see below (my apologies for the different layout)

image

Thanks for your time in discussing this!

Vittorio

mass-a commented 1 month ago

Hi Vittorio, thanks for sharing. Indeed the correlation between # of UMIs and UCell scores looks quite strong.

I wonder if that has to do with the increasing sparsity of the data as one decreases the number of counts (=decreased number of detected genes). With fewer detected genes, one increases the probability of UCell scores to be exactly zero, as opposed to cells with more detected genes where at least some of the genes in a given signature will have some counts. When one sets a higher maxRank, the effect is dampened because more genes of a signature, even those with zero counts, are included in the ranking (at the bottom of the ranking, but not exactly zero).

An experiment I would suggest to test whether the effect is technical or biological, is to start with one or more samples having large enough gene counts, and then generate alternative versions with reduced sequencing depth. This can be simulated e.g. using the ‘downsampleMatrix’ function from the scuttle package. How do UCell scores behave for the full depth vs. low depth versions of the same datasets?

zvittorio commented 2 weeks ago

Hi Massimo

I only reply now since I've been on vacation for a couple of weeks. Indeed, my reasoning for the increase in maxRank was exactly as you say, with more genes included in the ranking even if they have zero counts.

In addition I am trying the experiment you suggested with 'downsampleMatrix' from scuttle. However, I believe that this problem is due to my particular set of datasets where the gene coverage varies greatly. If anything interesting pops up down the line, I will report it here by reopening the issue. For now, thank you so much for your contribution, it was really helpful!

Kind regards Vittorio