aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
444 stars 183 forks source link

Is it reasonable to concatenate 2 different libraries and run SCENIC together? #367

Closed hyjforesight closed 2 years ago

hyjforesight commented 2 years ago

Hello SCENIC, Because we're not familiar with the GRNBoost2 algorithm, we're wondering whether it is reasonable to concatenate 2 different libraries and run SCENIC for this concatenated dataset. For example, I have WT and KO scRNA-seq data. I want to compare the differences of the regulatory networks between WT and KO. What is the best practice for this comparison? Run SCENIC for WT and KO individually or concatenate them first, then run SCENIC for this concatenated dataset? Thanks! Best, YJ

SeppeDeWinter commented 2 years ago

Hi YJ

Again sorry for the late reply. It is reasonable to concatenate the two different runs before running SCENIC, that's in fact what we also do in our automated pipelines (see: https://vsn-pipelines.readthedocs.io/en/latest/pipelines.html#harmony-scenic-harmony-scenic).

Hope this helps.

Good luck.

Seppe

hyjforesight commented 2 years ago

hello @SeppeDeWinter Thanks for the response!

hyjforesight commented 2 years ago

Hello @SeppeDeWinter Sorry for asking this question again. I checked the VSN pipeline. For the concatenated data, the input data for SCENIC is log-transformed, HVG selected, PCA generated, and batch-effect corrected (Harmony). However, the Nature Protocol paper recommends inputting raw data. Which data is the right one I should input for SCENIC? Thanks! Best, YJ

SeppeDeWinter commented 2 years ago

@hyjforesight

Indeed, for concatenating two runs it might be necessary to do some of these preprocessing steps you mentioned in order to remove effect arising from technical (but not biologica) difference between the two runs.

Hope this helps.

Best,

Seppe

cflerin commented 2 years ago

Hi @hyjforesight ,

Just to clarify here, the VSN Pipeline does not apply the preprocessing/transformation steps to the SCENIC input for Harmony, or any of the other multi-sample pipelines. The only steps applied are basic cell and gene filtering, then concatenation of the raw counts matrices.

However, if you do apply these transformation steps, you will likely get largely similar results. Most importantly, make sure that you do not filter for highly variable genes prior to running SCENIC; you should give the entire expression matrix without feature selection applied.

hyjforesight commented 2 years ago

Hello @SeppeDeWinter and @cflerin Thanks for the response!

Just to clarify here, the VSN Pipeline does not apply the preprocessing/transformation steps to the SCENIC input for Harmony, or any of the other multi-sample pipelines. The only steps applied are basic cell and gene filtering, then concatenation of the raw counts matrices.

If I didn't misunderstand, both VSN pipeline and the manual pipeline (http://htmlpreview.github.io/?https://github.com/aertslab/SCENICprotocol/blob/master/notebooks/SCENIC%20Protocol%20-%20Case%20study%20-%20Cancer%20data%20sets.html) only did the basic cell and gene filtering, which is the best practice for SCENIC. This is exactly what we're following now. But the VSN website shows the workflow of preprocessing/transformation, BBKNN, and the subsequent SCENIC, which may mislead the audience. image image


However, if you do apply these transformation steps, you will likely get largely similar results. Most importantly, make sure that you do not filter for highly variable genes prior to running SCENIC; you should give the entire expression matrix without feature selection applied. Indeed, for concatenating two runs it might be necessary to do some of these preprocessing steps you mentioned in order to remove effect arising from technical (but not biologica) difference between the two runs.

I understand that the results will be largely similar even if I use the transformed or integrated data as input. However, the BBKNN and Harmony integrations are usually based on the HVG selected data, which slices the entire expression matrix. Shall I skip the HVG selection step and use all genes to do the BBKNN or Harmony integration, then run SCENIC? (check the codes below)

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata[:, adata.var.highly_variable]    # do not do this step to keep the entire expression matrix?
sc.pp.regress_out(adata, keys=['total_counts', 'pct_counts_mt','pct_counts_rpl','pct_counts_rps'], n_jobs=16)
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver='arpack')
sce.pp.harmony_integrate(adata, key='batch', basis='X_pca', adjusted_basis='X_pca_harmony')
adata.obsm['X_pca']=adata.obsm['X_pca_harmony']
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, knn=True)
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)

row_attrs = {"Gene": np.array(adata.var_names)}
col_attrs = {"CellID": np.array(adata.obs_names), "nGene": np.array(np.sum(adata.X.transpose()>0, axis=0)).flatten(),
             "nUMI": np.array(np.sum(adata.X.transpose(), axis=0)).flatten()}
lp.create(f_loom_path_scenic, adata.X.transpose(), row_attrs, col_attrs)

!pyscenic grn {f_loom_path_scenic} {Mouse_TFS_FNAME} -o {ADJACENCIES_FNAME} --num_workers 16

Thanks! Best, YJ

cflerin commented 2 years ago

hi @hyjforesight . Yes, I can see how that description for bbknn_scenic could be confusing.

Shall I skip the HVG selection step and use all genes to do the BBKNN or Harmony integration, then run SCENIC? (check the codes below)

Take a look at Figure 1 from the Nature Protocols paper. The idea is that these analyses should be done in parallel, as they provide complementary information. So you should apply the ideal steps for each method, then aggregate the results.