aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
420 stars 179 forks source link

ranking target genes in a regulon #157

Closed lb15 closed 4 years ago

lb15 commented 4 years ago

Hi!

I am interested in some regulons that have 500-1000 gene targets. I'm looking for a way to rank those gene targets based on how good of a potential target it is for the given TF. I use the CLI implementation of pyscenic and output a csv file. I noticed in the .csv output of the ctx command, the target genes are following by a number. What does this number represent? A weighted score within each motif? This may have been asked in #94, but since there are multiple possible outputs from ctx, I'm not sure if we are talking about the same file/same number.

If I sorted genes by this number for each TF (independent of the motif), would this be an appropriate way to find the highest "weighted" target gene within a regulon? Or is there another way to score/rank the target genes?

Related to this, is the NES score used for the pruning of target genes? As a measure of how much enrichment of a particular motif is present in a co-expressed gene? I can only find NES scores related to individual motifs, but not genes.

Thanks very much for your help and for developing a great package!

cflerin commented 4 years ago

Hi @lb15 ,

These are network importance scores (from GRNBoost2/GENIE3) for each of the target genes in the regulon. In theory, yes, this could be used to get a very rough rank of the target gene importances, but I wouldn't really recommend this since it only takes into account information from the GRN and not the subsequent refinement step (ctx, which produces the refined regulons).

One approach we've used here is to run the whole pySCENIC procedure multiple times (~10-100x), and score each target gene by the number of times it occurs across all runs. This could be very computationally intensive for your entire dataset, but if you're only interested in a few regulons, you can run it for only these TFs. The multi-runs capability is implemented in the SCENIC section of our single cell Nextflow pipeline.

Another possibility is to take the pre-refinement modules for the regulon you're interested in, and export them for analysis in iRegulon. This will give you more control over the pruning and possibly a better idea of which genes are more important. See the last section "Further exploration of modules directly from the network inference output" in this notebook for an example of how to get started.

The NES is used for determining motif enrichment against the gene signature, but not on the level of individual genes.

lb15 commented 4 years ago

Thanks, this is very helpful, I will try these suggestions out!