Closed Rui-Jing closed 2 years ago
Hi @Rui-Jing
There is an internal step for correcting the reads in HINT-ATAC, so you don't need to shift the reads in the BAM file.
For input peaks, I usually use MACS2 and haven't tried MACS3 yet. I would say both 1 and 2 are fine.
Thanks for your quikly reply. I see the points.
I encounted into another error running rgt-hint tracks :
My code is : rgt-hint tracks --bc --bigWig --organism=mm10 alignment/neg1_733502_sorted.bam peaks/neg1.narrowPeak --output-prefix=neg1_733502_sorted --output-location=footprint.
The error information is :
Error : Traceback (most recent call last):
File "/home/zfzf/.local/bin/rgt-hint", line 8, in <module>
sys.exit(main())
File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/HINT/Main.py", line 95, in main
args.func(args)
File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/HINT/Tracks.py", line 59, in tracks_run
get_bc_tracks(args)
File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/HINT/Tracks.py", line 185, in get_bc_tracks
signal = reads_file.get_bc_signal_by_fragment_length(ref=region.chrom, start=region.initial,
File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/HINT/signalProcessing.py", line 928, in get_bc_signal_by_fragment_length
currRevComp = AuxiliaryFunctions.revcomp(str(fasta.fetch(ref, p1_wk + 1, p2_wk)).upper())
File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/Util.py", line 1173, in revcomp
return "".join([revDict[e] for e in s[::-1]])
File "/home/zfzf/.local/lib/python3.9/site-packages/rgt/Util.py", line 1173, in <listcomp>
return "".join([revDict[e] for e in s[::-1]])
KeyError: '8'
This error occurs every time the result file reaches a certain size.
The bam file from bowtie2 align and samtools sort ,there is no filtering. The narrow peaks came from Genrich : Genrich -t alignment_Genrich/pos1_701503.sort_n.bam -o peaks_Genrich/pos1.narrowPeak -j -y -r -e chrM -v
Since Genrich does not need a filtered BAM file as input, I do not know if unfiltered is the cause of the error.
Python version : 3.9.0 RGT version : v0.13.2
Hi,
For your reference, I checked my script and found mostly I used the 1 for peak calling, i.e.,
macs3 callpeak -f BAMPE -t ATAC.bam -g hs -n test -B -q 0.01
Can you check if there is "chr" in your BAM file by:
samtools view BAM_FILE | head
As far as I know, some aligners will only use a number to represent a chromosome, e.g., 1 instead of chr1.
Also, can you check the peaks file by:
head peaks_Genrich/pos1.narrowPeak
Best, Li
ok, I see. I will check to source code to understand the problem. __ Zhijian Li Institute for Computational Genomics RWTH Aachen University Pauwelsstrasse 19 52074 Aachen, Germany
On Fri, 10 Jun 2022 at 10:59, Rui-Jing @.***> wrote:
Re-again thanks your reply. I forgot to emphasize that I was successful in running rgt-hint footprinting. And for peaks, sed -i '/random|chrUn|chrM/d' pos1.narrowPeak Bam file:
[image: 398f34264e4744712295b25bad5c49f] http://url
Peaks file:
[image: 96afe8ddf8b70fee94cc39e48d49cbd] http://url
— Reply to this email directly, view it on GitHub https://github.com/CostaLab/reg-gen/issues/205#issuecomment-1152139069, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACL4WEX33HYKGGOSTUN3I2DVOL7V5ANCNFSM5YMR7DSQ . You are receiving this because you commented.Message ID: @.***>
Great, looking forward to your reply.
Hi @Rui-Jing
I checked the code, and the reason is that there is something wrong with reading DNA sequences from the mm10 reference genome.
currRevComp = AuxiliaryFunctions.revcomp(str(fasta.fetch(ref, p1_wk + 1, p2_wk)).upper())
Here, rgt-hint wants to read the DNA sequence in chromosome ref from _p1wk+1 to _p2wk and return its reverse component.
Here is how revcomp defined:
def revcomp(s):
"""Revert complement string.
*Keyword arguments:*
- s -- String.
"""
revDict = dict([("A", "T"), ("T", "A"), ("C", "G"), ("G", "C"), ("N", "N")])
return "".join([revDict[e] for e in s[::-1]])
As you can see, only A, C, G, T, and N are expected. However, for some reason, you got a 8 character from the reference genome, which I don't know why.
This error might be caught by rgt-hint footprinting
but obviously not by rgt-hint tracks
.
Can you set up the mm10 reference genome and try it again? simply by:
cd ~/rgtdata
python setupGenomicData.py --mm10
Let's see if this works.
Best, Zhijian
Hi, I tried to reset the reference genome, but it didn't work out so well.
Hi,
Would it be possible to share your data with me? so I can figure out what exactly is happening.
I'm sorry to late reply. I'm trying to get the data to you. I wonder if you have a good way to deliver large files?
@Rui-Jing
You can upload your files here: https://www.dropbox.com/request/rxoqdSZg3eo0KIHwS8YN
This is a file request created by Dropbox, so the data will keep confidential.
Since dropbox limits the size of a single file to 2GB, I split and compress bam files. The compressed file is being uploaded.
Thank you again for your patient help.
Hi @Rui-Jing
Thanks for sharing. I have downloaded the data, and ran the command:
rgt-hint tracks --bc --bigWig --organism=mm10 pos1_701503.sorted.bam ./pos1.narrowPeak --output-prefix=pos1.narrowPeak --output-location=./
However, it worked well for me and I didn't get the error that you posted.
Do you want me to share the obtained BW file with you? if so, can u let me know your email so I can send you the link.
best, Zhijian
Great, maybe the main problem is the software version. My email : 546401632@qq.com
.
Here I also want to ask if I want to know which transcription factors a target gene may be regulated by, eg; IGF1. Are there any features of RGT that can help me?
Great, maybe the main problem is the software version. My email : 546401632@qq.com.
I send the link to your email so you can download the BW file.
Here I also want to ask if I want to know which transcription factors a target gene may be regulated by, eg; IGF1. Are there any features of RGT that can help me?
Sure, you can associate the TFs to genes by using the function _geneassociation provided by RGT. Here is an example python script that I used in our project: https://www.dropbox.com/s/w1hnqcr27fgqys9/tf_to_gene.py?dl=0
To use it, simply run: python tf_to_gene.py your_motif_matching your_bam_file output_dir
This will give you a text file as output which includes three columns with the first two columns representing TF, and gene, and the third one representing the number of ATAC-seq reads around the TF binding sites.
Note that since this is a very simple approach that might produce some false positives, you can use the number of reads to further filter the prediction.
Best, Zhijian
Hi, I downloaded the tf_to_gene.py file and it ran successfully. But the results looks weird. The result looks like the output file was not written accurately.
My code is : python3 tf_to_gene.py atac_ljh/match/pos1_701503_sorted_mpbs.bed atac_ljh/alignment/pos1_701503.sorted.bam atac_ljh/match/pos1.txt
Hi,
The results are correct. The only problem is that some TFs are named XXX-var.2 or XXX-var.3, which need to be removed first.
You can do it by:
sed '/var.2/d;/var.3/d' old_mpbs.bed > new_mpbs.bed
Then use new_mpbs.bed as input for tf_to_gene.py
Hi, the output were very satisfactory! Once again, my sincere thanks.
We can filter some false positives by reads number. If I am interested in a filtered transcription factor, can I know its most likely binding site according to the bed file of rgt-hint footprinting.
Hi,
We can filter some false positives by reads number. If I am interested in a filtered transcription factor, can I know its most likely binding site according to the bed file of rgt-hint footprinting.
Sure, one quick way is to select the binding sites of this TF from the motif matching file. For example:
grep tf_name motif_matching_file > tf_name.bed
Then you can visualize the bed file with IGV and navigate to the gene that you are interested in, and find if there are any predicted binding sites around.
thanks for the response. now i realize that I can look through the BED file to find possible binding locations.
Great. I will close this thread and feel free to re-open it if there are new issues.
Hi, Thanks for developing this amazing package. I have a question about whether I should shift the reads of the BAM file as input to rgt-hint footprinting. I don't know if hint-ATAC does any corrective work within the software. And what kind of bam files does NucleoATAC expect?
Reference: In the first ATAC-seq paper (Buenrostro et al., 2013), all reads aligning to the + strand were offset by +4 bp, and all reads aligning to the – strand were offset −5 bp, since Tn5 transposase has been shown to bind as a dimer and insert two adaptors separated by 9 bp (Adey et al., 2010).
I face the same issue with peak calling by MACS3 after shifting the reads: https://github.com/macs3-project/MACS/issues/145 as -f BAMPE and -f BAM have different behavior.
For a standard peaks input of HINT-ATAC,which of the following 1 or 2 is the more expected input: 1. Peak calling on ATAC-seq (paired-end mode): $ macs3 callpeak -f BAMPE -t ATAC.bam -g hs -n test -B -q 0.01 2. Peak calling on ATAC-seq ( focusing on insertion sites, and using single-end mode): $ macs3 callpeak -f BAM -t ATAC.bam -g hs -n test -B -q 0.01 --shift -50 --extension 100