aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
439 stars 181 forks source link

[results] Relation between different positional weight matrices #332

Closed BenSolomon closed 2 years ago

BenSolomon commented 3 years ago

Apologies if this is a naive question, but if different PWM reflect the extent of the promoter sequence considered for possible TF targets, would that make a 500bp_up_and_100bp_down_tss PWM a subset of a 10kb_up_and_down_tss PWM? In other words, would you expect to find any regulons enriched by a 500bp_up_and_100bp_down_tss PWM that wouldn't be enriched with a 10kb_up_and_down_tss PWM, given the later should also cover the +500 to -100bp region from the TSS sequence for each target?

If the answer is no, I'm curious about part of the code in the tutorial for incorporating ChIP-seq tracks rather than motif sequences as PWMs. Specifically this segment:

import glob
# ranking databases
f_db_glob = "/staging/leuven/res_00001/databases/cistarget/databases/homo_sapiens/hg38/refseq_r80/tc_v1/gene_based/encode_20190621__ChIP_seq_transcription_factor.hg38__refseq-r80__*feather"
f_db_names = ' '.join( glob.glob(f_db_glob) )
print( f_db_names )

### OUTPUT: /ddn1/vol1/staging/leuven/res_00001/databases/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc9nr/gene_based/hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather /ddn1/vol1/staging/leuven/res_00001/databases/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc9nr/gene_based/hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather

# motif databases
f_track_path = "/staging/leuven/stg_00002/lcb/icistarget/data/annotations/homo_sapiens/hg38/track_annotations/encode_project_20190621__ChIP-seq_transcription_factor.homo_sapiens.hg38.bigwig_signal_pvalue.track_to_tf_in_motif_to_tf_format.tsv"

!pyscenic ctx adj.tsv \
    {f_db_names} \
    --annotations_fname {f_track_path} \
    --expression_mtx_fname {f_loom_path_scenic} \
    --output reg_tracks.csv \
    --num_workers 20

This seems to utilize both 500bp_up_and_100bp_down_tss and 10kb_up_and_down_tss PWMs at the same time. My two questions related to this are:

1) What is happening when multiple PWMs are specified? How are the scores for identical motif/track x TF target values between the two matrices combined?

2) As above, will the combination of the of the 500bp_up_and_100bp_down_tss and 10kb_up_and_down_tss PWMs result in any enriched targets that the 10kb_up_and_down_tss PWM would not also enrich on its own?

Thank you!

lucygarner commented 3 years ago

I have a similar question - I am using the following cisTarget databases: (1) hg38refseq-r80500bp_up_and_100bp_down_tss.mc9nr.feather (2) hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather

I would like to understand how the scores from these are combined in pySCENIC.

cflerin commented 2 years ago

I believe this comment in #334 may cover this question. When using multiple cisTarget databases, each module is pruned separately against each database, then later combined.