I've been dealing with long runtimes on some code, and finally tracked the source down to bedtools behaving oddly.
I have a sorted BAM file of RNA-Seq reads (HG00097_mapsplice_alignment.sort.bam), and I'd like to gather the reads which overlap a BEDfile (merged_flanks.bed) containing exactly one interval (chr5:6732686-6737489). In addition, if a read has been spliced, I only want to keep it if an exonic part overlaps the interval, not if the intron spans it.
This was resolved in an off-line conversation. A small query with large DB is faster with samtools due to it's use of indexing. Bedtools does not use indexing because that requires I/O.
I've been dealing with long runtimes on some code, and finally tracked the source down to bedtools behaving oddly.
I have a sorted BAM file of RNA-Seq reads (HG00097_mapsplice_alignment.sort.bam), and I'd like to gather the reads which overlap a BEDfile (merged_flanks.bed) containing exactly one interval (chr5:6732686-6737489). In addition, if a read has been spliced, I only want to keep it if an exonic part overlaps the interval, not if the intron spans it.
So:
LSBATCH: User input
time bedtools intersect -split -bed -wa -abam ../../HG00097_mapsplice_alignment.sort.bam -b merged_flanks.bed > new_reads.bed
real 26m38.142s user 6m15.475s sys 0m3.602s
25 minutes?! Removing the obsolete -abam tag improves things 5x but is still running hella slow:
LSBATCH: User input
time bedtools intersect -split -bed -wa -a ../../HG00097_mapsplice_alignment.sort.bam -b merged_flanks.bed > new_reads.bed
real 5m49.682s user 5m47.060s sys 0m2.511s
By contrast, it's much faster to glob the regions in the BED file and then use samtools to read the BAM file:
LSBATCH: User input
time samtools view -bh ../../HG00097_mapsplice_alignment.sort.bam chr5:6732686-6737489 | bedtools bamtobed -bed12 -i - | bedtools intersect -split -wa -a - -b merged_flanks.bed > timetest.bed
real 0m0.151s user 0m0.035s sys 0m0.014s