Regulon Specificity Scores Are Correlated With Size

ggruenhagen3 commented 6 months ago

Bug Description When using the regulon_specificity_scores on my auc_matrix, I find that the values are strongly correlated with the size of my groups of cells. This can be seen in the image below as the samples (rows) are sorted from largest to smallest. In the image, the largest samples (sample_01-03; 5599-8254 cells) have the largest scores and the smallest sample (sample_14-16; 57-314 cells) have the smallest scores. I found the correlation of each regulon with sample size and found the mean correlation was 0.967 and the minimum was 0.751.

Steps to reproduce the behavior I am using regulon_specificity_scores to compare regulons across my samples (ie within regulon and across samples). The relevant code is below.

rss = regulon_specificity_scores( auc, adata.obs['sample'] )

Brief Description of Previous Steps For more context, here is a brief description of the steps take before this. I used the CLI I input my raw counts matrix (22526 genes x 43456 cells) without batch correction. I already ran pyscenic grn, pyscenic ctx, and pyscenic auc. I executed this on a HPC running CentOS Linux 7 and I am using pyscenic 0.12.1 installed via pip.

Questions

Is this expected behavior? -> It is undesirable to me at least because I want to compare regulons across samples.
If this is expected, do you think it would be appropriate for me to normalize the regulon_specificity_scores by sample size (ie divide the scores by the # of cells in the sample) in order to make the values comparable?
If normalizing by sample size is not a good idea, are there other ways to compare regulons across samples that I'm missing?

ggruenhagen3 commented 6 months ago

I found that taking the mean of AUC matrix (ie from lf.ca.RegulonsAUC) by samples produced results that made sense biologically and matched our expectations. It seems this strategy worked well, but I wanted to check to see if this is a reasonable strategy.

ggruenhagen3 commented 6 months ago

It seems like others have used this strategy and the dev team didn't raise any warnings (https://github.com/aertslab/pySCENIC/issues/75#issue-444833301). I will assume this is reasonable, unless the dev team states otherwise.

aertslab / pySCENIC

Regulon Specificity Scores Are Correlated With Size #542