Closed ggruenhagen3 closed 6 months ago
I found that taking the mean of AUC matrix (ie from lf.ca.RegulonsAUC) by samples produced results that made sense biologically and matched our expectations. It seems this strategy worked well, but I wanted to check to see if this is a reasonable strategy.
It seems like others have used this strategy and the dev team didn't raise any warnings (https://github.com/aertslab/pySCENIC/issues/75#issue-444833301). I will assume this is reasonable, unless the dev team states otherwise.
Bug Description When using the regulon_specificity_scores on my auc_matrix, I find that the values are strongly correlated with the size of my groups of cells. This can be seen in the image below as the samples (rows) are sorted from largest to smallest. In the image, the largest samples (sample_01-03; 5599-8254 cells) have the largest scores and the smallest sample (sample_14-16; 57-314 cells) have the smallest scores. I found the correlation of each regulon with sample size and found the mean correlation was 0.967 and the minimum was 0.751.
Steps to reproduce the behavior I am using regulon_specificity_scores to compare regulons across my samples (ie within regulon and across samples). The relevant code is below.
Brief Description of Previous Steps For more context, here is a brief description of the steps take before this. I used the CLI I input my raw counts matrix (22526 genes x 43456 cells) without batch correction. I already ran
pyscenic grn
,pyscenic ctx
, andpyscenic auc
. I executed this on a HPC running CentOS Linux 7 and I am using pyscenic 0.12.1 installed via pip.Questions