Closed gdagstn closed 1 year ago
Hello! thanks for reporting this, and for the reproducible example.
Personally, I am not convinced it makes a lot of sense to use such large gene sets for single-cell data. One of your signatures (Ductal cells) contains more than 1500 genes, another one is also larger than 1000 genes. This is in the order of the # of detected genes in a typical single-cell dataset; in fact in your pancreas example there are many cells with very low counts:
library(scuttle)
metrics <- perCellQCMetrics(panc)
hist(metrics$detected, breaks=50)
This is also where the maxRank
parameter comes into play. UCell scores are based on gene expression rankings; because typically only a few thousand genes are detected per cell, the rankings are only calculated on the top maxRank
genes (default 1500). In this way, the tail of zeros that constitute the majority of the data are not accounted in the ranking, and you will gain sensitivity to capture high- vs. low-expression of genes. If you set a maxRank
much larger than the number of actually detected genes, then you have a situation where you "rank" genes with zero gene expression - you can see how this can give unstable results. As far as I know AUCell uses a similar strategy; the fact that its results are not affected by the maximum rank makes me think that they automatically override this parameter when it is higher than the number of detected genes, which is probably not a bad idea anyways. We would have to check the code to be sure.
In summary, my suggestion would be to: 1) use smaller gene sets. If these come from bulk RNA-seq DE, perhaps one can focus on more highly significant genes. For example, I took 200 random genes for each signature of your MURARO gene sets, and re-run the analysis; the results are a lot more stable to the value of maxRank
(which does not mean they are correct). 2) Set a maxRank
parameter that reflects the depth of the dataset, so that the majority of cells have # detected genes > maxRank. I would not typically use maxRank>2000
unless you have very good sequencing depth and want to be able to detect genes with very low expression.
I hope this helps, -massimo
Hi Massimo,
thanks for the insightful reply.
I see what you mean and I agree with you that it would be better to take smaller genesets; the issue with using signatures from msigdb is that they are given "as is" and do not have any sort of internal ranking (I do not know a priori which ones are the highly significant). So a random downsampling may not be always good and even less robust. One can always go back to the relevant paper and find ways to downsize these signatures but then I guess it becomes more cumbersome.
Thanks again!
Hi,
Thanks for developing this package! It's very fast and easy to use.
I was comparing some methods for automatic cell annotation and I noticed that UCell was giving me odd results. I narrowed it down to the
maxRank
parameter, which I think makes sense given how it is used to normalize the U statistic and calculate scores.I attach a simple reprex.
In this example I am using the Segerstolpe et al. 2016 pancreas single cell dataset (obtained through the Bioconductor
scRNAseq
package), which I want to annotate using marker/genesets from the Muraro et al. 2016 pancreas single cell dataset (obtained through the Bioconductormsigdbr
package). My simple idea is to use the maximum score value per dataset per cell to annotate each cell to a particular type.Conversely, if we use
AUCell
, the results look more stable:What I find potentially worrying is that using a value of
maxRank
close to the default (1600, since one of the genesets has >1500 genes) I would have missed out on several other populations in the dataset.So my question is: is there a particular way or rationale to choose a value for
maxRank
without having to try different values?In a small and well-characterized dataset such as this one this is not a big issue, but I was wondering about applying
UCell
to larger, more complex datasets.Alternatively, is
UCell
designed to use smaller genesets? Do large (800/1000) genesets throw it out because of the denominator?Thanks in advance and sorry for the lengthy post.