aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
439 stars 181 forks source link

Consistency between pySCENIC and SCENIC results #290

Closed ivanawinkler closed 3 years ago

ivanawinkler commented 3 years ago

Hello,

thank you for the great tool.

I firstly implemented R version of SCENIC for one of our smaller datasets and we got some very interesting results. So I wanted to analyze our big datasets using pySCENIC since it is much faster in comparison to R version. However, when I tested my python script with my dataset I got a big difference in results between R and python version. R SCENIC in the end determined the activity scores for 381 transcription factors while pySCENIC only determined scores for 4 transcription factors. Looking at the log files, the majority of transcription factors was excluded during the pruning step (with warning message "Less than 80% of the genes in some_gene could be mapped to *. feather".)

I have tried using Docker image of pySCENIC just to exclude that it's an issue with dask version. I got the same result.

I am wondering whether there are some differences in the default settings between R and python version which could explain the difference in the result?

For scRNA-seq standards I have a small dataset of 163 cells of oocytes and granulosa cells.

My R code is:

  org <- "mgi" # or hgnc, or dmel
  dbDir <- dbDir # RcisTarget databases location
  myDatasetTitle <- myDatasetTitle # choose a name for your analysis
  data(defaultDbNames)
  dbs <- defaultDbNames[[org]]
  scenicOptions <- initializeScenic(org=org, dbDir=dbDir, dbs=dbs, datasetTitle=myDatasetTitle, nCores=nCores)
# (Adjust minimum values according to your dataset)
  genesKept <- geneFiltering(exprs_matrix, scenicOptions=scenicOptions,
                           minCountsPerGene=3*.01*ncol(exprs_matrix),
                           minSamples=ncol(exprs_matrix)*.01)
  exprMat_filtered <- exprs_matrix[genesKept, ]
  runCorrelation(exprMat_filtered, scenicOptions)
  exprMat_filtered <- log2(exprMat_filtered+1)
  list(exprMat_filtered, scenicOptions)}

#running Genie3 (Building the gene regulatory network (GRN))

  set.seed(123)
  runGenie3(exprMat_filtered, scenicOptions)
  scenicOptions@settings$seed <- 123
  scenicOptions <- runSCENIC_1_coexNetwork2modules(scenicOptions)

#Select potential direct-binding targets (regulons) based on DNA-motif analysis (RcisTarget: TF motif analysis)
  scenicOptions <- runSCENIC_2_createRegulons(scenicOptions)

#Analyzing the network activity in each individual cell (AUCell)
  scenicOptions <- runSCENIC_3_scoreCells(scenicOptions, exprMat_filtered)

  regulonAUC <- loadInt(scenicOptions, "aucell_regulonAUC")
  regulonAUC <- regulonAUC[onlyNonDuplicatedExtended(rownames(regulonAUC)),]

My python code:

nCores = 2
    local_cluster = LocalCluster(
        n_workers=4,
        ncores=nCores,
        dashboard_address=None if DKFZ_CLUSTER else ':7878',
        # worker_dashboard_address=None if DKFZ_CLUSTER else '7879',
        scheduler_port=0,
    )       #possibility 2, fast!
    with Client(local_cluster) as client:                 #(1:20 for nCores = 32)
        adjacencies = grnboost2(expression_data=reads, 
                        tf_names=tf_names, seed = 123, verbose=True, client_or_address=client)

    modules = list(modules_from_adjacencies(adjacencies, reads))
    #write modules in a file as binary and as text
    with open("modules.csv", "wb") as f:
        pickle.dump(modules, f)
    modules_txt = open("modules.csv", "w")
    modules_txt.write(str(modules))
    modules_txt.close()

    #%%
    df = prune2df(dbs, modules, f_motif_path, client_or_address="dask_multiprocessing") #originally "local"
    regulons = df2regulons(df)

Thank you for your help.

Best regards,

Ivana

cflerin commented 3 years ago

Hi @ivanawinkler ,

There are a few things that could be different here. The first thing is maybe your filter settings on the expression matrix -- it's not clear whether you're using the same input for both methods. The other major one is the GRN method: GENIE3 (R) vs GRNBoost2 (python). Here, I see you set the same seed, but these are different algorithms entirely so I wouldn't expect this to give identical results (although generally they will be very similar). Lastly, you have the parameters for the pruning step (prune2df or runSCENIC_2_createRegulons)... did you use the same databases, etc?

That said, it does seem very strange that you only find 4 regulons with pySCENIC. It's normal to see a few of these "Less than 80%" warnings, but if it's really excluding most of the TFs here, it could indicate some problem with your data (proper gene names used?).

ivanawinkler commented 3 years ago

Thank you very much for you answer.

I use the same input for both implementations (same expression matrix and same databases) and that is what is puzzling me. Since I am using the same expression matrix and databases, gene names should not be an issue (otherwise I guess R would have same issues). In my expression matrix, I use mouse external gene name and databases I downloaded following SCENIC instructions: dbFiles <- c("https://resources.aertslab.org/cistarget/databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/mm9-500bp-upstream-7species.mc9nr.feather", "https://resources.aertslab.org/cistarget/databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/mm9-tss-centered-10kb-7species.mc9nr.feather")

ivanawinkler commented 3 years ago

Hello,

I think I have figured out the issue. As I wrote earlier I used the same count matrix for both analysis, but in rSCENIC I use following filtering steps:

genesKept <- geneFiltering(exprs_matrix, scenicOptions=scenicOptions,
                           minCountsPerGene=3*.01*ncol(exprs_matrix),
                           minSamples=ncol(exprs_matrix)*.01)
exprMat_filtered <- exprs_matrix[genesKept, ]
runCorrelation(exprMat_filtered, scenicOptions)
exprMat_filtered <- log2(exprMat_filtered+1)

as recommened in tutorial, while in pySCENIC I just passed unfiltered matrix (same one as the one I start in R).

If I use the aforementioned R code to generate the matrix and the feed it to pySCENIC I get comparable results.

Best,

Ivana