aertslab / SCENICprotocol

A scalable SCENIC workflow for single-cell gene regulatory network analysis
GNU General Public License v3.0
139 stars 62 forks source link

Help with Analysis for Multiple samples, Counts normalization? #58

Open AlinaKurjan opened 2 years ago

AlinaKurjan commented 2 years ago

Hi, I have questions about how to carry out pySCENIC analysis correctly using data from more than a single sample. For example, I am working with data from 3 biological replicates which I intend to integrate using Harmony later on. My questions are: 1) do the pyscenic steps of the protocol need to be carried out separately for each dataset or can the datasets be merged into a single one first and then that used as an input for the grn? If the latter is fine, can the counts used come from an already integrated dataset? 2) do the counts need to be normalised? I see that this is something shown in many tutorials, however not in your protocol paper?

Thank you for your time.

AlinaKurjan commented 2 years ago

Okay, I've found a previous reply to my Q2 by @bramvds here https://github.com/aertslab/pySCENIC/issues/128so please nvm. Quoting below:

Because SCENIC's first step, i.e. network inference using GENIE3/GRNBoost2, relies on tree-based methods there should be no need to transform the gene expression matrix. GENIE3 is based on a "regression per target gene" strategy using a Random Forest (RF) algorithm under the hood to capture non-linear relationships between factor and target. Features do not need to be scaled or transformed for a RF technique to work properly. See also: https://stats.stackexchange.com/questions/58697/when-to-log-exp-your-variables-when-using-random-forest-models . In fact, the GENIE3 tutorial (https://bioconductor.org/packages/release/bioc/vignettes/GENIE3/inst/doc/GENIE3.html) also mentions: "Note that the expression data do not need to be normalised in any way".

However, due to the probabilistic nature of the GENIE3/GRNBoost algorithms you will get different results when running pySCENIC several times on the same data set. I strategy to deal with this is to run pySCENIC multiple times and tally the recurrent regulons.