arq5x / bedtools

A powerful toolset for genome arithmetic.
http://code.google.com/p/bedtools/
GNU General Public License v2.0
140 stars 85 forks source link

inaccurate counts with bedtools intersect -split and -f #148

Closed benslack19 closed 5 years ago

benslack19 commented 5 years ago

I'm trying to do something that I thought was straightforward but is not giving me the expected answer using bedtools intersect. I had written this to the Google forum (yet to be approved by the moderator) and therefore I kept working on it. As such, I've found a workaround solution which I share at the bottom.

The task I'm attempting is "How many reads from my RNA-seq sample overlap with any exon from my GTF file, given that overlaps are at least 50% of the read's length?" The majority of our read lengths are between 73 and 76 (one read of a paired-end sequencing is trimmed) so I expected overlaps of at least 37 bp to be counted.

From bedtools documentation, one way I believed to perform the the call was the following:

$BEDTOOLS intersect -abam ${STAR_BAM_PA} -b ${GTF_REF_EXON} -bed -u -f 0.5 -split | wc -l

The bedtools version I'm using is 2.26 (but I get the same results with 2.27). STAR_BAM_PA is my primary alignments BAM file using STAR then samtools. GTF_REF_EXON refers to the genomic coordinates file from NCBI RefSeq with only the lines containing "exon".

Not surprisingly, removing the -f parameter (so that the default is min. 1 bp overlap) increased this number which was expected. In order to visualize what was actually going on, I removed the wc -l and focused on one transcript and the reads that would align there. To my surprise, I found that when using the -f parameter that all of the reads were within an exon and ignored those that spanned exon-exon boundaries.

overlap_50_1bp

As you can see, the reads without the -f parameter showed reads spanning exon-exon junctions, several of which one end was spanning an exon with I thought using -split would take care of this as it is from RNA-seq. The image is the same as above but zooming in between exon 1 and the start of exon 2.

overlap_50_1bp_exon1

I tried the following:

These all did not yield the desired results.

My interpretation of this is that the -f parameter isn't working on the length of the -a input file has the documentation states (since the STAR_BAM_PA file appears to show the original read lengths), but rather on the resulting alignment which is a partial output of the bedtools intersect call. This misses many reads that span exon-exon junctions. This appears to be a bug to me.

My workaround is the following:

$BEDTOOLS intersect -abam STAR_BAM_PA -b GTF_REF_EXON -bed -split \ | cut -f 1,2,3,4,5 \ | awk '$3-$2 > 36.5’ \ | awk ' { print $2, $3, $3-$2, $4}’ \ | cut -d ' ' -f 4 \ | uniq \ | wc -l

Line 1: The output of the bedtools intersect call with -split shows the overlaps to an exon. Line 2 and 3: I filter the results to have only overlaps greater than 36.5 (since most read lengths are at least 73). Lines 4-6: I count only the reads once if they have this overlap.

Of course it'd be nice if something can be recommended (or fixed) to use in bedtools where I don't have to do this. (It's more accurate than the bedtools only call but it requires hard-coding of the read length filter and other testing shows it stops with very large BAM files.) I look forward to feedback.

benslack19 commented 5 years ago

I realized I should put this on the bedtools2 github page so I'm closing here and opening there.