HYsxe / PRINT

32 stars 3 forks source link

How do I predict which site is occupied by which transcription factor? #14

Closed OMIC-coding closed 7 months ago

HYsxe commented 9 months ago

Hi!

Thanks for the question! Currently we will first identify a potential TF binding site using multi-scale footprinting to detect a TF footprint, or use the TF binding model to find positions with high TF binding scores. Then for these candidate TF binding sites, we overlap with underlying TF motif matches to assign it to specific TFs.

Note that although most mainstream footprinting methods rely on motif matching, it is far from a perfect approach. Even in the most popular motif databases, many motifs are mis-labeled or just have low quality. Additionally, TFs in the same family often share similar motifs, making it difficult to distinguish them.

Hope this helps!

Yan

OMIC-coding commented 8 months ago

Dear Yan, Thank you for your reply.

  1. I want to know if "overlap with underlying TF motif matches to assign it to specific TFs" is included in your tutorial file, as specific TF names are absent in the TFbindingSE result picture.
  2. Is pseudobulk object indispensable to follow the pipeline, if I only need to process scATAC-seq data?
  3. How is the .bed file generated, which is mentioned in your tutorial file? (How do I pick regions of interest and then transform it into .bed file)?
HYsxe commented 8 months ago

Hi!

  1. So the TFBindingSE object only stores the TF-agnostic binding scores for each genomic location (i.e. it provides a general scoring for TF binding for any TF). The motif positions can be matched using the package motifmatchr. We also provide a function to do this called getMotifPositions. You can find more details here https://github.com/HYsxe/PRINT/blob/main/analyses/TFBSPrediction/ChIPValidationBMMC.R line 81-116
# Load PWM data
cisBPMotifs <- readRDS("../../data/shared/cisBP_human_pwms_2021.rds")

# Find motif matches
motifPositions <- getMotifPositions(project, cisBPMotifs, combineTFs = F)

Once you have the motif ranges, you can find the positions with high TF binding scores and use the findOverlaps function from the GenomicRanges package to get sites with both a motif match and a high binding score. These are the "predicted binding sites"

  1. We do pseudo-bulking because a certain number of reads is needed for footprinting. Say we have an scATAC dataset where each cell has 20k reads, given that there are at least 200k cCREs in the genome, each promoter or enhancer gets on average 0.1 read. It is impossible to footprint with such low read depth since we need multiple reads to delineate the contour of the footprinted object. Therefore, we always pseudo-bulk to >10M reads per pseudo-bulk for footprinting. Having single cell data allows you to form pseudo-bulks using specific cell types or over a continuum of cell states, but will likely not get around the issue of depth.

  2. The .bed file is generated by calling ATAC-seq peaks using MACS2. We've been using a pretty old version of the command but I think you can start by trying the ENCODE standard like below

macs2 callpeak --nomodel -t frag_file \
    --outdir outdir -n name -f format \
    --nolambda --keep-dup all --call-summits \
    --nomodel -B --SPMR --shift 75 --extsize 150 

We also further remove reigions overlapping with blacklist regions (Can be found here https://github.com/buenrostrolab/multiScaleFootprinting/tree/0819ac66321c46573b08337f57fe2fcb0e030e4c/data/shared/BlacklistFiles). Additionally, you can also remove overlapping peaks to get an disjoint peak set.

Hope this helps!

Yan