LiuLabUB / HMMRATAC

HMMRATAC peak caller for ATAC-seq data
GNU General Public License v3.0
99 stars 23 forks source link

BR: #79

Open EllieDuan opened 3 years ago

EllieDuan commented 3 years ago

Describe the bug

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at HMMR_ATAC.Main_HMMR_Driver.main(Main_HMMR_Driver.java:425)

To Reproduce This error only show up when used --trim 3 option

System (please complete the following information):

Additional context

HagaiHargil commented 3 years ago

Hi, any ideas how can one fix it, or go around it? I'm obviously only interested in the open areas (i.e. no nucleosomes) of the chromosome I'm analyzing.

EllieDuan commented 3 years ago

Hi, any ideas how can one fix it, or go around it? I'm obviously only interested in the open areas (i.e. no nucleosomes) of the chromosome I'm analyzing.

Not really helpful here regarding HMMRATAC.

I ended up selected open regions using Samtools for insert size less than 100bp: https://accio.github.io/bioinformatics/2020/03/10/filter-bam-by-insert-size.html

Then calling peaks on these open regions bam files using Macs2.

HagaiHargil commented 3 years ago

Thanks for the reply, I'll try that out.

Why didn't you call the peaks on the new file using HMMRATAC? Is the output of your workflow incompatible with what HMMRATAC is looking for?

EvanTarbell commented 3 years ago

Sorry for the delay. Let me start by asking if this data was size selected? Ie on a gel etc.

EllieDuan commented 3 years ago

Thanks for the reply, I'll try that out.

Why didn't you call the peaks on the new file using HMMRATAC? Is the output of your workflow incompatible with what HMMRATAC is looking for?

I think HMMRATAC works better for input without size selection for their model, based on their paper: https://academic.oup.com/nar/article/47/16/e91/5519166, therefore, I decided to go back the traditional way. The bam file after Samtool size selection should be compatible with HMMRATAC.

EllieDuan commented 3 years ago

Sorry for the delay. Let me start by asking if this data was size selected? Ie on a gel etc.

The bam file I input in HMMRATAC that run into this error message was not being size selected, the library for sequencing is also not selected for insert size. It contains all short and long fragments. Thank you.

EvanTarbell commented 3 years ago

The bam file I input in HMMRATAC that run into this error message was not being size selected, the library for sequencing is also not selected for insert size. It contains all short and long fragments. Thank you.

I think there is a misunderstanding about the —trim option. It isn’t a parameter for excluding nucleosomes or isolating open regions. It is meant to allow the processing of libraries that were subjected to some size selection. So if your data wasn’t, I don’t recommend using that option. If you are only interested in open regions, and not the flanking nucleosomes, then use the coordinates at columns 7-8 of the gappedPeak file, which designates the open regions alone. But the algorithm should be run with the whole library and without —trim.

As an aside, regardless of the software, you shouldn’t do any size selection with ATAC-seq, physical or computational. We were able to show, in the HMMRATAC paper, that size selection results in a loss of precision compared to non selected data

EllieDuan commented 3 years ago

The bam file I input in HMMRATAC that run into this error message was not being size selected, the library for sequencing is also not selected for insert size. It contains all short and long fragments. Thank you.

I think there is a misunderstanding about the —trim option. It isn’t a parameter for excluding nucleosomes or isolating open regions. It is meant to allow the processing of libraries that were subjected to some size selection. So if your data wasn’t, I don’t recommend using that option. If you are only interested in open regions, and not the flanking nucleosomes, then use the coordinates at columns 7-8 of the gappedPeak file, which designates the open regions alone. But the algorithm should be run with the whole library and without —trim.

As an aside, regardless of the software, you shouldn’t do any size selection with ATAC-seq, physical or computational. We were able to show, in the HMMRATAC paper, that size selection results in a loss of precision compared to non selected data

Thank you for the clarification, this makes much sense now! Many thanks!

HagaiHargil commented 3 years ago

@EvanTarbell Thanks a lot for the info! I'll be sure to look at the gappedPeaks file. BTW, regardless, the trim option is still buggy and causing the crash @EllieDuan described here.

EvanTarbell commented 3 years ago

Yes, i understand there is an issue with —trim I’m going to reopen this issue so I can address whatever bug caused the original error

HagaiHargil commented 3 years ago

Hello again @EvanTarbell,

I just want to make sure I got your previous explanation regarding the open chromatin regions right.

What I did was to run HMMRATAC on an ENCODE ATAC-Seq dataset to see if I can re-create standard fragment length distribution plots as seen in many ATAC-Seq-centered works. The results show that the average fragament size is much larger than expected, and I hardly see any fragments with a length below 150 bp, which should point at the nucleosome-free regions.

Frag length

I'm reaching out since I'm (obviously) very new to this field and I'm not sure if I'm missing something basic. I'd be happy if you could make sure that the dataset I'm looking at should indeed produce that fragment size histogram I was referring to.

Thanks again and sorry if this is out of scope.

HagaiHargil commented 3 years ago

For future reference, I did arrive at some sort of a solution. That same BAM file already contains the correct fragment length distribution, we just have to extract it out. I found this blog post with the following intimidating one-liner:

samtools view ATAC_f2q30_sorted.bam | awk '$9>0' | cut -f 9 | sort | uniq -c | sort -b -k2,2n | sed -e 's/^[ \t]*//' > fragment_length_count.txt

This generates a text file that can be histogrammed, and the resulting histogram does indeed contain the necessary periodicity. As a caveat, I'm not sure that this is best-practice, or for that matter even correct :)