deeptools / deepTools

Tools to process and analyze deep sequencing data.
Other
675 stars 208 forks source link

extendReads not properly extending short insert paired reads ... #1070

Open benbfly opened 3 years ago

benbfly commented 3 years ago

I am using deeptools bamCoverage to analyze short (35-80bp) paired end fragments. All reads in my bam file are properly paired based on samtools flagstat and Picard. Yet bamCoverage extends some and does not extend others. Here is the command:

bamCoverage -p 2 --extendReads --ignoreDuplicates --minFragmentLength 35 --maxFragmentLength 80 --binSize 1 -b seqs.hg38.bam --samFlagExclude 256 -r chr3:37135538:37164618 -of bedgraph -o seqs.hg38_002.bedgraph

In the attached screenshot, the read pairs with inserts of 35-80bp are shown in gray (there are 3). The first one (insert size=68) gets properly extended to the actual fragment size. But the other two do not, they get extended to the auto-calculated fragment average which includes the longer fragments (defaultFragmentLength=162bp). I can not figure out what is wrong with these that they are not recognized as properly paired. I attached the SAM lines for these 3 reads pairs below.

I really need to get this to work! I am only showing a small region here, but genome-wide about half of the short pairs are properly extended and half are not.

Thanks! - Ben.

version deeptools 3.5.0

problem problem.hg38.sam.txt

dpryan79 commented 3 years ago

Can you post the SAM entries for those 6 reads?

benbfly commented 3 years ago

Can you post the SAM entries for those 6 reads?

I attached them at the bottom of the original message, file called "problem.hg38.sam.txt"

dpryan79 commented 3 years ago

Ah, interesting. I wonder if this is due to the reverse strand reads starting before the forward strand reads in those two cases. Dovetailing is biological impossibility in Illumina sequencing and I'm really not sure how that should be handled, since it suggests that something went very odd during preprocessing. Did you trim the 5' end of the reads?

dpryan79 commented 3 years ago

Can you let me know what version of pysam you have installed? The behavior I expect to see in the code is that only the overlapping portion of the reads receives coverage when they dovetail like this. Since that's not happening, it seems like pysam is telling deepTools that the reads are not properly paired, which is odd, but would lead to the coverage track that you're seeing.

benbfly commented 3 years ago

I believe Pysam is 0.15.3

Regarding the origin of these, it could have to do with a problem with the trimming (in this case soft clipped by DRAGEN aligner). It could also have do with the fact that we are using single-stranded libraries, which may have a few overhang bases that are not being handled correctly.

But it sounds like these should pass in theory with the newest Pysam (looking now to check if there's a Docker version of deeptools :)

Let me know if you want me to send you a small BAM for this region. I thought you could probably replicate with the 5 SAM lines.

dpryan79 commented 3 years ago

These lines should suffice. There is a docker container from the biocontainer project for each version of deepTools (they're auto-generated by bioconda).

dpryan79 commented 3 years ago

OK, I've found the cause. Internally we do some sanity checking rather than blindly believing the flags set by the aligner. In this case the dovetailing internally changes the "proper pair" status to "false". In general, this the correct thing to do, so I don't want to change it. In this particular case, however, you can just comment out lines 780-785 of countReadsPerBin.py (remove the .pyc file of the same name in __pycache__/). That will prevent this from happening.

benbfly commented 3 years ago

I am going to look into why our DRAGEN alignments are clipped like this, and I will also look at BWA alignments of the same reads, or if it's due to the specific library kit we use (Claret SRSLY). These short reads can be super important when mapping high-resolution TFBSs and nucleosome positions, especially in assays like ATAC-seq, DNase-seq, and cfDNA. I know that in our application it's adding a lot of noise to our alignments, and I imagine if it's affecting us then it may be affecting others as well.

I will report back what I find, and if it is also a problem in other aligners, maybe we can try to address it. These are considered properly paired not only by PySam, but also by samtools C version. I think intuitively it's correct to consider it properly paired even if 1 or 2 bases are missing from the end. With the multitude of different library kits, I think all kinds of things happen to the very end of a read.

dpryan79 commented 3 years ago

Samtools and pysam simply report what flags are set, they don't do any sanity checking. Usually it's the 3' end of reads that are a bit wonky, not the 5' end.

benbfly commented 3 years ago

My mistake on the properly paired flag, I remembered that wrong.

People do hard clipping for a variety of library-specific reasons. For instance, Swift makes the most popular single-stranded kits for various applications. Their technology introduces a variably sized (8-15bp) low complexity sequence at the beginning of Read2 (https://swiftbiosci.com/wp-content/uploads/2019/02/16-0853-Tail-Trim-Final-442019.pdf). For this reason nf-core workflows that support this kit automatically hard trim 15bp from 5' end of R2 (https://nf-co.re/methylseq/1.6.1/parameters#special-library-types.)

You call this sanity checking, but it can actually cause major differences these kinds of applications. In my example fragments, the two ends overlap by over 95% of their length. In cases like this, I think it's probably better to trust the aligner about properly paired status rather than overruling it and incorrectly extending the read. This is just the application I am most familiar with, but I think people use hard clipping in a variety of contexts. I'm sure the read extension usually goes unnoticed by Deeptools users, who don't often go inspecting how their reads are being extended.

I am sure I can find a workaround, but I think this is something that should be considered for changing at some point (with a conservative rule like if the two ends overlap by >95% then respect the aligner's "properly paired" setting).