Sensitivity to maxRank parameter

gdagstn commented 1 year ago

Hi,

Thanks for developing this package! It's very fast and easy to use.

I was comparing some methods for automatic cell annotation and I noticed that UCell was giving me odd results. I narrowed it down to the maxRank parameter, which I think makes sense given how it is used to normalize the U statistic and calculate scores.

I attach a simple reprex.

In this example I am using the Segerstolpe et al. 2016 pancreas single cell dataset (obtained through the Bioconductor scRNAseq package), which I want to annotate using marker/genesets from the Muraro et al. 2016 pancreas single cell dataset (obtained through the Bioconductor msigdbr package). My simple idea is to use the maximum score value per dataset per cell to annotate each cell to a particular type.

library(msigdbr)
library(clustree)
library(UCell)
library(AUCell)
library(scRNAseq)

# Retrieve genesets
type_genes = msigdbr("Homo sapiens", category = "C8")
genesets = lapply(split(type_genes, type_genes$gs_name), function(x) x$gene_symbol)
muraro_genes = genesets[grep("MURARO", names(genesets))]

# Increase readability
names(muraro_genes) = gsub(names(muraro_genes), pattern = "MURARO_PANCREAS_", replacement = "")

# Retrieve dataset
panc = SegerstolpePancreasData()

# Loop through several values of maxRank
maxranks = c(1600, 2000, 2500, 3000, 3500, 4000)

ucell_scorelist = lapply(maxranks, function(mr) {
  scores = ScoreSignatures_UCell(matrix = counts(panc), features = muraro_genes, maxRank = mr)
  assigned = names(muraro_genes)[apply(scores, 1, which.max)]
  return(assigned)
})

## a few genes are not found... 

# Visualize relationships
names(ucell_scorelist) = paste0("MR_", maxranks)
ucell_scoremat = do.call(cbind, ucell_scorelist)
clustree(ucell_scoremat, prefix = "MR_", node_text_size = 1.5, node_text_angle = 15)

Conversely, if we use AUCell, the results look more stable:

rankings <- AUCell_buildRankings(counts(panc),
                                 plotStats = FALSE,
                                 verbose = FALSE)

auc_scorelist = lapply(maxranks, function(mr) {

  aucs <- AUCell_calcAUC(muraro_genes,
                         rankings,
                         aucMaxRank = mr)

  scores <- as.data.frame(t(assay(aucs)))
  assigned <- names(muraro_genes)[apply(scores, 1, which.max)]
  return(assigned)
})

names(auc_scorelist) = paste0("MR_", maxranks)
auc_scoremat = do.call(cbind, auc_scorelist)
clustree(auc_scoremat, prefix = "MR_", node_text_size = 2)

What I find potentially worrying is that using a value of maxRank close to the default (1600, since one of the genesets has >1500 genes) I would have missed out on several other populations in the dataset.

So my question is: is there a particular way or rationale to choose a value for maxRank without having to try different values?

In a small and well-characterized dataset such as this one this is not a big issue, but I was wondering about applying UCell to larger, more complex datasets.

Alternatively, is UCell designed to use smaller genesets? Do large (800/1000) genesets throw it out because of the denominator?

Thanks in advance and sorry for the lengthy post.

mass-a commented 1 year ago

Hello! thanks for reporting this, and for the reproducible example.

Personally, I am not convinced it makes a lot of sense to use such large gene sets for single-cell data. One of your signatures (Ductal cells) contains more than 1500 genes, another one is also larger than 1000 genes. This is in the order of the # of detected genes in a typical single-cell dataset; in fact in your pancreas example there are many cells with very low counts:

library(scuttle)
metrics <- perCellQCMetrics(panc)
hist(metrics$detected, breaks=50)

This is also where the maxRank parameter comes into play. UCell scores are based on gene expression rankings; because typically only a few thousand genes are detected per cell, the rankings are only calculated on the top maxRank genes (default 1500). In this way, the tail of zeros that constitute the majority of the data are not accounted in the ranking, and you will gain sensitivity to capture high- vs. low-expression of genes. If you set a maxRank much larger than the number of actually detected genes, then you have a situation where you "rank" genes with zero gene expression - you can see how this can give unstable results. As far as I know AUCell uses a similar strategy; the fact that its results are not affected by the maximum rank makes me think that they automatically override this parameter when it is higher than the number of detected genes, which is probably not a bad idea anyways. We would have to check the code to be sure.

In summary, my suggestion would be to: 1) use smaller gene sets. If these come from bulk RNA-seq DE, perhaps one can focus on more highly significant genes. For example, I took 200 random genes for each signature of your MURARO gene sets, and re-run the analysis; the results are a lot more stable to the value of maxRank (which does not mean they are correct). 2) Set a maxRank parameter that reflects the depth of the dataset, so that the majority of cells have # detected genes > maxRank. I would not typically use maxRank>2000 unless you have very good sequencing depth and want to be able to detect genes with very low expression.

I hope this helps, -massimo

gdagstn commented 1 year ago

Hi Massimo,

thanks for the insightful reply.

I see what you mean and I agree with you that it would be better to take smaller genesets; the issue with using signatures from msigdb is that they are given "as is" and do not have any sort of internal ranking (I do not know a priori which ones are the highly significant). So a random downsampling may not be always good and even less robust. One can always go back to the relevant paper and find ways to downsize these signatures but then I guess it becomes more cumbersome.

Thanks again!

carmonalab / UCell

Sensitivity to maxRank parameter #25