I'm trying to do something that I thought was straightforward but is not giving me the expected answer using bedtools intersect. I had written this to the Google forum (yet to be approved by the moderator) and therefore I kept working on it. As such, I've found a workaround solution which I share at the bottom.
The task I'm attempting is "How many reads from my RNA-seq sample overlap with any exon from my GTF file, given that overlaps are at least 50% of the read's length?" The majority of our read lengths are between 73 and 76 (one read of a paired-end sequencing is trimmed) so I expected overlaps of at least 37 bp to be counted.
From bedtools documentation, one way I believed to perform the the call was the following:
The bedtools version I'm using is 2.26 (but I get the same results with 2.27). STAR_BAM_PA is my primary alignments BAM file using STAR then samtools. GTF_REF_EXON refers to the genomic coordinates file from NCBI RefSeq with only the lines containing "exon".
Not surprisingly, removing the -f parameter (so that the default is min. 1 bp overlap) increased this number which was expected. In order to visualize what was actually going on, I removed the wc -l and focused on one transcript and the reads that would align there. To my surprise, I found that when using the -f parameter that all of the reads were within an exon and ignored those that spanned exon-exon boundaries.
As you can see, the reads without the -f parameter showed reads spanning exon-exon junctions, several of which one end was spanning an exon with I thought using -split would take care of this as it is from RNA-seq. The image is the same as above but zooming in between exon 1 and the start of exon 2.
I tried the following:
both with and without –split option (results were the same)
inputting all 24 permutations of the four optional parameters (-bed -f 0.5 -u -split)
removing the -u
These all did not yield the desired results.
My interpretation of this is that the -f parameter isn't working on the length of the -a input file has the documentation states (since the STAR_BAM_PA file appears to show the original read lengths), but rather on the resulting alignment which is a partial output of the bedtools intersect call. This misses many reads that span exon-exon junctions. This appears to be a bug to me.
Line 1: The output of the bedtools intersect call with -split shows the overlaps to an exon.
Line 2 and 3: I filter the results to have only overlaps greater than 36.5 (since most read lengths are at least 73).
Lines 4-6: I count only the reads once if they have this overlap.
Of course it'd be nice if something can be recommended (or fixed) to use in bedtools where I don't have to do this. (It's more accurate than the bedtools only call but it requires hard-coding of the read length filter and other testing shows it stops with very large BAM files.) I look forward to feedback.
I'm trying to do something that I thought was straightforward but is not giving me the expected answer using
bedtools intersect
. I had written this to the Google forum (yet to be approved by the moderator) and therefore I kept working on it. As such, I've found a workaround solution which I share at the bottom.The task I'm attempting is "How many reads from my RNA-seq sample overlap with any exon from my GTF file, given that overlaps are at least 50% of the read's length?" The majority of our read lengths are between 73 and 76 (one read of a paired-end sequencing is trimmed) so I expected overlaps of at least 37 bp to be counted.
From bedtools documentation, one way I believed to perform the the call was the following:
$BEDTOOLS intersect -abam ${STAR_BAM_PA} -b ${GTF_REF_EXON} -bed -u -f 0.5 -split | wc -l
The
bedtools
version I'm using is 2.26 (but I get the same results with 2.27).STAR_BAM_PA
is my primary alignments BAM file usingSTAR
thensamtools
.GTF_REF_EXON
refers to the genomic coordinates file from NCBI RefSeq with only the lines containing "exon".Not surprisingly, removing the
-f
parameter (so that the default is min. 1 bp overlap) increased this number which was expected. In order to visualize what was actually going on, I removed thewc -l
and focused on one transcript and the reads that would align there. To my surprise, I found that when using the-f
parameter that all of the reads were within an exon and ignored those that spanned exon-exon boundaries.As you can see, the reads without the -f parameter showed reads spanning exon-exon junctions, several of which one end was spanning an exon with I thought using
-split
would take care of this as it is from RNA-seq. The image is the same as above but zooming in between exon 1 and the start of exon 2.I tried the following:
–split
option (results were the same)-bed -f 0.5 -u -split
)-u
These all did not yield the desired results.
My interpretation of this is that the
-f
parameter isn't working on the length of the -a input file has the documentation states (since theSTAR_BAM_PA
file appears to show the original read lengths), but rather on the resulting alignment which is a partial output of thebedtools intersect
call. This misses many reads that span exon-exon junctions. This appears to be a bug to me.My workaround is the following:
$BEDTOOLS intersect -abam STAR_BAM_PA -b GTF_REF_EXON -bed -split \
| cut -f 1,2,3,4,5 \
| awk '$3-$2 > 36.5’ \
| awk ' { print $2, $3, $3-$2, $4}’ \
| cut -d ' ' -f 4 \
| uniq \
| wc -l
Line 1: The output of the bedtools intersect call with -split shows the overlaps to an exon. Line 2 and 3: I filter the results to have only overlaps greater than 36.5 (since most read lengths are at least 73). Lines 4-6: I count only the reads once if they have this overlap.
Of course it'd be nice if something can be recommended (or fixed) to use in bedtools where I don't have to do this. (It's more accurate than the
bedtools
only call but it requires hard-coding of the read length filter and other testing shows it stops with very large BAM files.) I look forward to feedback.