aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
439 stars 181 forks source link

Questions about the results of pySCENIC #128

Open ShaowenJ opened 4 years ago

ShaowenJ commented 4 years ago

Hi pySCENIC developers,

I tried with pySCENIC roughly, and the results were pretty good and interesting, which confirmed by our Seurat foundings. But I am not very sure how to interpret the results, so I have several questions that hope you can kindly answer.

  1. What's the recommended input count matrix for the pipeline? The raw count matrix, or need some normalization like log2? I tried with both, but got different results. Do you have some suggestions on that? I read the SCENIC paper and it said it preferred gene-summarized counts, but what's that mean. Could you make an example?

  2. What's the AUC value threshold for selecting a good strong and significant regulon. Here is my distribution plots for my clusters. And as you can see, they seem to be very low, the highest ones are around AUC 0.3. Based on my knowledge, that's probably not a very good value.

image

  1. I also played with RSS ranking, they are also very low. Are the value of RSS higher mean that regulon is more activated in that cluster of cells? image

Thanks very much for your patience and time.

ShaowenJ commented 4 years ago

I have just another question might be interesting, So after calculating the RSS, we get a list of rank regulons for each of the clusters. Would it be interested to do a comparison on these lists to find statistically differential regulons? (like assigning a p value) I guess this would be possible by running a nonparametric method like Mann-Whitney U test. Has anyone done this before?

bramvds commented 4 years ago

Hi

  1. Because SCENIC's first step, i.e. network inference using GENIE3/GRNBoost2, relies on tree-based methods there should be no need to transform the gene expression matrix. GENIE3 is based on a "regression per target gene" strategy using a Random Forest (RF) algorithm under the hood to capture non-linear relationships between factor and target. Features do not need to be scaled or transformed for a RF technique to work properly. See also: https://stats.stackexchange.com/questions/58697/when-to-log-exp-your-variables-when-using-random-forest-models . In fact, the GENIE3 tutorial (https://bioconductor.org/packages/release/bioc/vignettes/GENIE3/inst/doc/GENIE3.html) also mentions: "Note that the expression data do not need to be normalised in any way".

However, due to the probabilistic nature of the GENIE3/GRNBoost algorithms you will get different results when running pySCENIC several times on the same data set. I strategy to deal with this is to run pySCENIC multiple times and tally the recurrent regulons.

  1. Do you display the distribution of AUCell values for the same regulon across cell type clusters in your experiment in this plot? If so, this indicates that there is no clear association between that regulon and any cluster of cells in your experiment.

Some remarks on how to interpret AUCell scores: The scores or values provided by SCENIC's last step, AUCell, are enrichment scores and need to be interpreted taking some restrictions into account: (1) You can only compare these unnormalized scores to assess the relative importance of a regulon between different cells (or clusters of cells). Comparing unnormalized scores of different regulons across cells or clusters of cells should not be done. (2) The actual magnitude of the raw scores depends on several factors (including technical ones like auc_threshold). The derive biological insights from you can look at the distribution of the AUCell values of a regulon across all cells (e.g. bimodal distribution indicates the presence of two types of cells in the experiment - on versus off) or compare the average AUCell scores for that regulon between two clusters of cells (and a permutation test can be used to get a p-value for this comparison).

  1. The RSS metric identifies a relationship between a regulon and a predefined cluster of cells - high RSS scores indicate regulons strongly "associated" with the cells of the cluster. In this case, we typically map the RSS scores for the same cluster of cells for all regulons identified by pySCENIC. Right tail regulons indicate a role of these regulons in driving in the identity of the cells of these clusters [See also plots in the first publication that used the RSS metric - Suo, S., Zhu, Q., Saadatpour, A., Fei, L., Guo, G., Yuan, G. (2018). Revealing the Critical Regulators of Cell Identity in the Mouse Cell Atlas. Cell Reports 25(6), 1436 1445.e3. https://dx.doi.org/10.1016/j.celrep.2018.10.045].

If I interpret your RSS plot correctly, you should investigate the regulons at the right end of your distribution plots for each cluster individually and investigate if they are specific to that cluster.

Hope this helps, Bram