aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
420 stars 179 forks source link

Extended vs. non-extended labels in pySCENIC #12

Closed wyattmcdonnell closed 6 years ago

wyattmcdonnell commented 6 years ago

Hi there!

So excited to see all of this incredible work ported into Python! The performance in our hands has moved things from about a day-long computation to somewhere around half an hour—so thanks again!

One thing that I've noticed is that there are (+) and (−) labels for the regulons, but it's unclear which of these (if either?) is the extended version of the regulon. In the R version of SCENIC there are clear labels for _extended, but I'm not sure how to distinguish between these in the current iteration. Could you point me in the right direction?

Best wishes, Wyatt

bramvds commented 6 years ago

Dear Wyatt,

Thank you for the feedback!

Regarding your question related to the (+) and (-) suffixes for the regulons: this is experimental work. The (+) indicates that there is a positive correlation between the expression levels of a TF and its target genes across cells (from which we infer that the regulon is a transcriptional activator), while the (-) indicates the positive, i.e. a transcriptional inactivating relationship between the TF and its targets. This is work in progress. The (+) labelled regulons are the ones you would get from running the original R version.

If you want to filter based on direct or indirect TF annotations you can easily do so by first filtering the dataframe with enriched motifs [i.e. df = prune2df()] and only subsequently derive regulons from the filtered df using df2regulons. You should focus on the column "Annotation" which should contain "gene is directly annotated" for direct annotations. More fine grained control is possible by also using the columns "MotifSimilarityQValue" (should be 0 if no motif similarity is needed to find a matching TF annotation for the enriched motif in the species under investigation) and "OrthologousIdentity" (a value between 0.0 and 1.0 which signifies the orthologous identity of the DBD of involved TF proteins when SCENIC needs to cross species boundaries to find an appropriate annotation for the enriched motifs).

Hope this helps, Bram

dschrein commented 6 years ago

i just wanted to second the compliment here - the performance improvement over the R version is staggering: we went from one week for to under a day. :)

the support here has also been prompt and excellent. thank you!

bramvds commented 6 years ago

You're more than welcome. Thanks for the feedback, much appreciated.