CostaLab / reg-gen

Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
https://reg-gen.readthedocs.io/
Other
103 stars 30 forks source link

Two questions about rgt-hint footprinting input : shift reads and more expected peaks #205

Closed Rui-Jing closed 2 years ago

Rui-Jing commented 2 years ago

Hi, Thanks for developing this amazing package. I have a question about whether I should shift the reads of the BAM file as input to rgt-hint footprinting. I don't know if hint-ATAC does any corrective work within the software. And what kind of bam files does NucleoATAC expect?

Reference: In the first ATAC-seq paper (Buenrostro et al., 2013), all reads aligning to the + strand were offset by +4 bp, and all reads aligning to the – strand were offset −5 bp, since Tn5 transposase has been shown to bind as a dimer and insert two adaptors separated by 9 bp (Adey et al., 2010).

I face the same issue with peak calling by MACS3 after shifting the reads: https://github.com/macs3-project/MACS/issues/145 as -f BAMPE and -f BAM have different behavior.

For a standard peaks input of HINT-ATAC,which of the following 1 or 2 is the more expected input: 1. Peak calling on ATAC-seq (paired-end mode): $ macs3 callpeak -f BAMPE -t ATAC.bam -g hs -n test -B -q 0.01 2. Peak calling on ATAC-seq ( focusing on insertion sites, and using single-end mode): $ macs3 callpeak -f BAM -t ATAC.bam -g hs -n test -B -q 0.01 --shift -50 --extension 100

lzj1769 commented 2 years ago

Hi @Rui-Jing

There is an internal step for correcting the reads in HINT-ATAC, so you don't need to shift the reads in the BAM file.

For input peaks, I usually use MACS2 and haven't tried MACS3 yet. I would say both 1 and 2 are fine.

Rui-Jing commented 2 years ago

Thanks for your quikly reply. I see the points.

I encounted into another error running rgt-hint tracks :

My code is : rgt-hint tracks --bc --bigWig --organism=mm10 alignment/neg1_733502_sorted.bam peaks/neg1.narrowPeak --output-prefix=neg1_733502_sorted --output-location=footprint.

The error information is :

Error : Traceback (most recent call last):
  File "/home/zfzf/.local/bin/rgt-hint", line 8, in <module>
    sys.exit(main())
  File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/HINT/Main.py", line 95, in main
    args.func(args)
  File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/HINT/Tracks.py", line 59, in tracks_run
    get_bc_tracks(args)
  File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/HINT/Tracks.py", line 185, in get_bc_tracks
    signal = reads_file.get_bc_signal_by_fragment_length(ref=region.chrom, start=region.initial,
  File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/HINT/signalProcessing.py", line 928, in get_bc_signal_by_fragment_length
    currRevComp = AuxiliaryFunctions.revcomp(str(fasta.fetch(ref, p1_wk + 1, p2_wk)).upper())
  File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/Util.py", line 1173, in revcomp
    return "".join([revDict[e] for e in s[::-1]])
  File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/Util.py", line 1173, in <listcomp>
    return "".join([revDict[e] for e in s[::-1]])
KeyError: '8'

This error occurs every time the result file reaches a certain size.

The bam file from bowtie2 align and samtools sort ,there is no filtering. The narrow peaks came from Genrich : Genrich -t alignment_Genrich/pos1_701503.sort_n.bam -o peaks_Genrich/pos1.narrowPeak -j -y -r -e chrM -v

Since Genrich does not need a filtered BAM file as input, I do not know if unfiltered is the cause of the error.

Python version : 3.9.0 RGT version : v0.13.2

lzj1769 commented 2 years ago

Hi,

For your reference, I checked my script and found mostly I used the 1 for peak calling, i.e., macs3 callpeak -f BAMPE -t ATAC.bam -g hs -n test -B -q 0.01

Can you check if there is "chr" in your BAM file by: samtools view BAM_FILE | head

As far as I know, some aligners will only use a number to represent a chromosome, e.g., 1 instead of chr1.

Also, can you check the peaks file by: head peaks_Genrich/pos1.narrowPeak

Best, Li

Rui-Jing commented 2 years ago

Re-again thanks your reply. I forgot to emphasize that I was successful in running rgt-hint footprinting. And for peaks, sed -i '/random\|chrUn\|chrM/d' pos1.narrowPeak Bam file: 398f34264e4744712295b25bad5c49f

Peaks file: 96afe8ddf8b70fee94cc39e48d49cbd

lzj1769 commented 2 years ago

ok, I see. I will check to source code to understand the problem. __ Zhijian Li Institute for Computational Genomics RWTH Aachen University Pauwelsstrasse 19 52074 Aachen, Germany

On Fri, 10 Jun 2022 at 10:59, Rui-Jing @.***> wrote:

Re-again thanks your reply. I forgot to emphasize that I was successful in running rgt-hint footprinting. And for peaks, sed -i '/random|chrUn|chrM/d' pos1.narrowPeak Bam file:

[image: 398f34264e4744712295b25bad5c49f] http://url

Peaks file:

[image: 96afe8ddf8b70fee94cc39e48d49cbd] http://url

— Reply to this email directly, view it on GitHub https://github.com/CostaLab/reg-gen/issues/205#issuecomment-1152139069, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACL4WEX33HYKGGOSTUN3I2DVOL7V5ANCNFSM5YMR7DSQ . You are receiving this because you commented.Message ID: @.***>

Rui-Jing commented 2 years ago

Great, looking forward to your reply.

lzj1769 commented 2 years ago

Hi @Rui-Jing

I checked the code, and the reason is that there is something wrong with reading DNA sequences from the mm10 reference genome.

currRevComp = AuxiliaryFunctions.revcomp(str(fasta.fetch(ref, p1_wk + 1, p2_wk)).upper())

Here, rgt-hint wants to read the DNA sequence in chromosome ref from _p1wk+1 to _p2wk and return its reverse component.

Here is how revcomp defined:

    def revcomp(s):
        """Revert complement string.

        *Keyword arguments:*

            - s -- String.
        """
        revDict = dict([("A", "T"), ("T", "A"), ("C", "G"), ("G", "C"), ("N", "N")])
        return "".join([revDict[e] for e in s[::-1]])

As you can see, only A, C, G, T, and N are expected. However, for some reason, you got a 8 character from the reference genome, which I don't know why.

This error might be caught by rgt-hint footprinting but obviously not by rgt-hint tracks.

Can you set up the mm10 reference genome and try it again? simply by:

cd ~/rgtdata
python setupGenomicData.py --mm10

Let's see if this works.

Best, Zhijian

Rui-Jing commented 2 years ago

Hi, I tried to reset the reference genome, but it didn't work out so well.

Rui-Jing commented 2 years ago

Relevant screenshots are as follows: 0a62a24f42949fbdc2b653f64e52fab 71e7a48730ddb47a3ccf0b6ecc60234

lzj1769 commented 2 years ago

Hi,

Would it be possible to share your data with me? so I can figure out what exactly is happening.

Rui-Jing commented 2 years ago

I'm sorry to late reply. I'm trying to get the data to you. I wonder if you have a good way to deliver large files?

lzj1769 commented 2 years ago

@Rui-Jing

You can upload your files here: https://www.dropbox.com/request/rxoqdSZg3eo0KIHwS8YN

This is a file request created by Dropbox, so the data will keep confidential.

Rui-Jing commented 2 years ago

Since dropbox limits the size of a single file to 2GB, I split and compress bam files. The compressed file is being uploaded.

Thank you again for your patient help.

lzj1769 commented 2 years ago

Hi @Rui-Jing

Thanks for sharing. I have downloaded the data, and ran the command: rgt-hint tracks --bc --bigWig --organism=mm10 pos1_701503.sorted.bam ./pos1.narrowPeak --output-prefix=pos1.narrowPeak --output-location=./

However, it worked well for me and I didn't get the error that you posted.

Do you want me to share the obtained BW file with you? if so, can u let me know your email so I can send you the link.

best, Zhijian

Rui-Jing commented 2 years ago

Great, maybe the main problem is the software version. My email : 546401632@qq.com.

Here I also want to ask if I want to know which transcription factors a target gene may be regulated by, eg; IGF1. Are there any features of RGT that can help me?

lzj1769 commented 2 years ago

Great, maybe the main problem is the software version. My email : 546401632@qq.com.

I send the link to your email so you can download the BW file.

Here I also want to ask if I want to know which transcription factors a target gene may be regulated by, eg; IGF1. Are there any features of RGT that can help me?

Sure, you can associate the TFs to genes by using the function _geneassociation provided by RGT. Here is an example python script that I used in our project: https://www.dropbox.com/s/w1hnqcr27fgqys9/tf_to_gene.py?dl=0

To use it, simply run: python tf_to_gene.py your_motif_matching your_bam_file output_dir

This will give you a text file as output which includes three columns with the first two columns representing TF, and gene, and the third one representing the number of ATAC-seq reads around the TF binding sites.

Note that since this is a very simple approach that might produce some false positives, you can use the number of reads to further filter the prediction.

Best, Zhijian

Rui-Jing commented 2 years ago

Hi, I downloaded the tf_to_gene.py file and it ran successfully. But the results looks weird. The result looks like the output file was not written accurately.

My code is : python3 tf_to_gene.py atac_ljh/match/pos1_701503_sorted_mpbs.bed atac_ljh/alignment/pos1_701503.sorted.bam atac_ljh/match/pos1.txt

My motifmapping file is: 1655115338722

The results is : 7dfb37a5eb60241d3ac271ee7736c5d 7cac9f03b7fd7b6505682c8284f2e87

lzj1769 commented 2 years ago

Hi,

The results are correct. The only problem is that some TFs are named XXX-var.2 or XXX-var.3, which need to be removed first.

You can do it by: sed '/var.2/d;/var.3/d' old_mpbs.bed > new_mpbs.bed

Then use new_mpbs.bed as input for tf_to_gene.py

Rui-Jing commented 2 years ago

Hi, the output were very satisfactory! Once again, my sincere thanks.

We can filter some false positives by reads number. If I am interested in a filtered transcription factor, can I know its most likely binding site according to the bed file of rgt-hint footprinting.

lzj1769 commented 2 years ago

Hi,

We can filter some false positives by reads number. If I am interested in a filtered transcription factor, can I know its most likely binding site according to the bed file of rgt-hint footprinting.

Sure, one quick way is to select the binding sites of this TF from the motif matching file. For example: grep tf_name motif_matching_file > tf_name.bed

Then you can visualize the bed file with IGV and navigate to the gene that you are interested in, and find if there are any predicted binding sites around.

Rui-Jing commented 2 years ago

thanks for the response. now i realize that I can look through the BED file to find possible binding locations.

lzj1769 commented 2 years ago

Great. I will close this thread and feel free to re-open it if there are new issues.