Question about gene filtering

aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.

http://scenic.aertslab.org

GNU General Public License v3.0

439 stars 181 forks source link

Question about gene filtering #213

Closed jrknoedler closed 4 years ago

jrknoedler commented 4 years ago

I'm working on analyzing a large number of single-cell datasets and would like to add regulon analysis. Initially I was looking at the R implementation of SCENIC, but would rather use pySCENIC; I've gotten the singularity container working and, as advertised, it runs at least 10x faster, which I greatly prefer since it should speed up tweaking analysis parameters etc. I'm doing all my QC and filtering in Seurat prior to exporting to loom, so my question is whether the singularity container filters genes by expression as described in the R vingette prior to inferring regulons, or whether I would to do that manually before passing the loom file to pySCENIC. Thanks!

jrknoedler commented 4 years ago

Also, a related question - I'm running this on 10X Genomics data and was comparing the results with and without masking dropouts. They give broadly similar results, but masking dropouts seems to give a longer list - and, in my case, only dropout masking revealed a gene I expected to see since it's a marker for the population I was sorting for prior to sequencing. I'm wondering - is there a standard for whether dropout masking is advisable or not? Thanks!

cflerin commented 4 years ago

Hi @jrknoedler ,

pySCENIC doesn't do any filtering, it runs directly on the matrix you give as input (unless you use one of the Nextflow workflows, which it doesn't sound like you're doing). So yes, you would need to do the filtering prior.

I prefer to not mask dropouts. With masking, for each TF-gene pair it only compares cells in which there is non-zero expression in both cells. And this (unfairly in my opinion), excludes cells where the TF is expressed but the target gene isn't (and the other way around), even though these cells are a mixture of dropout and non-expressed.