aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
420 stars 179 forks source link

pySCENIC for bulk RNAseq #136

Closed alyamahmoud closed 4 years ago

alyamahmoud commented 4 years ago

I want to use pySCENIC for bulk RNAseq tumour samples.

I am thinking that the major issue will be optimizing the threshold for the enrichment score and then the regulon activity score.

I will test using the normalized DESeq2 raw read counts and FPKMs. Is there an approach I should favour for one vs the other ?

your feedback/comments will be appreciated

alyamahmoud commented 4 years ago

In Suo et al., they applied SCENIC to infer regulons based on the group-averaged gene expression profiles of 20 cells. Doesn't this somehow approximate to bulk RNAseq but of course in a reduced heterogeneous way ?

bramvds commented 4 years ago

Dear,

Running SCENIC on bulkRNA-seq is not a problem. The first phase is based on GENIE3 which was initially developed for bulkRNA-seq: Huynh-Thu, V., Irrthum, A., Wehenkel, L., Geurts, P. (2010). Inferring regulatory networks from expression data using tree-based methods. PLoS ONE 5(9)https://dx.doi.org/10.1371/journal.pone.0012776.

Regarding the expression units to be used as input to GENIE3/GRNBoost2: because SCENIC's first step, i.e. network inference using GENIE3/GRNBoost2, relies on tree-based methods there should be no need to transform the gene expression matrix. GENIE3 is based on a "regression per target gene" strategy using a Random Forest (RF) algorithm under the hood to capture non-linear relationships between factor and target. Features do not need to be scaled or transformed for a RF technique to work properly. See also: https://stats.stackexchange.com/questions/58697/when-to-log-exp-your-variables-when-using-random-forest-models . In fact, the GENIE3 tutorial (https://bioconductor.org/packages/release/bioc/vignettes/GENIE3/inst/doc/GENIE3.html) also mentions: "Note that the expression data do not need to be normalised in any way".

However, due to the probabilistic nature of the GENIE3/GRNBoost algorithms you will get different results when running pySCENIC several times on the same data set. I strategy to deal with this is to run pySCENIC multiple times and tally the recurrent regulons.

Hope this helps, Bram

alyamahmoud commented 4 years ago

awesome ! thank you very much