aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
439 stars 181 forks source link

Regulon Specificity Scores Are Correlated With Size #542

Closed ggruenhagen3 closed 6 months ago

ggruenhagen3 commented 6 months ago

Bug Description When using the regulon_specificity_scores on my auc_matrix, I find that the values are strongly correlated with the size of my groups of cells. This can be seen in the image below as the samples (rows) are sorted from largest to smallest. In the image, the largest samples (sample_01-03; 5599-8254 cells) have the largest scores and the smallest sample (sample_14-16; 57-314 cells) have the smallest scores. I found the correlation of each regulon with sample size and found the mean correlation was 0.967 and the minimum was 0.751.

image

Steps to reproduce the behavior I am using regulon_specificity_scores to compare regulons across my samples (ie within regulon and across samples). The relevant code is below.

rss = regulon_specificity_scores( auc, adata.obs['sample'] )

Brief Description of Previous Steps For more context, here is a brief description of the steps take before this. I used the CLI I input my raw counts matrix (22526 genes x 43456 cells) without batch correction. I already ran pyscenic grn, pyscenic ctx, and pyscenic auc. I executed this on a HPC running CentOS Linux 7 and I am using pyscenic 0.12.1 installed via pip.

Questions

  1. Is this expected behavior? -> It is undesirable to me at least because I want to compare regulons across samples.
  2. If this is expected, do you think it would be appropriate for me to normalize the regulon_specificity_scores by sample size (ie divide the scores by the # of cells in the sample) in order to make the values comparable?
  3. If normalizing by sample size is not a good idea, are there other ways to compare regulons across samples that I'm missing?
ggruenhagen3 commented 6 months ago

I found that taking the mean of AUC matrix (ie from lf.ca.RegulonsAUC) by samples produced results that made sense biologically and matched our expectations. It seems this strategy worked well, but I wanted to check to see if this is a reasonable strategy.

ggruenhagen3 commented 6 months ago

It seems like others have used this strategy and the dev team didn't raise any warnings (https://github.com/aertslab/pySCENIC/issues/75#issue-444833301). I will assume this is reasonable, unless the dev team states otherwise.