arq5x / bedtools2

bedtools - the swiss army knife for genome arithmetic
MIT License
937 stars 287 forks source link

bedtools intersect is doubleplus slow on BAM files #183

Closed csoeder closed 9 years ago

csoeder commented 9 years ago

I've been dealing with long runtimes on some code, and finally tracked the source down to bedtools behaving oddly.

I have a sorted BAM file of RNA-Seq reads (HG00097_mapsplice_alignment.sort.bam), and I'd like to gather the reads which overlap a BEDfile (merged_flanks.bed) containing exactly one interval (chr5:6732686-6737489). In addition, if a read has been spliced, I only want to keep it if an exonic part overlaps the interval, not if the intron spans it.

So:

LSBATCH: User input

time bedtools intersect -split -bed -wa -abam ../../HG00097_mapsplice_alignment.sort.bam -b merged_flanks.bed > new_reads.bed

real 26m38.142s user 6m15.475s sys 0m3.602s

25 minutes?! Removing the obsolete -abam tag improves things 5x but is still running hella slow:


LSBATCH: User input

time bedtools intersect -split -bed -wa -a ../../HG00097_mapsplice_alignment.sort.bam -b merged_flanks.bed > new_reads.bed

real 5m49.682s user 5m47.060s sys 0m2.511s

By contrast, it's much faster to glob the regions in the BED file and then use samtools to read the BAM file:


LSBATCH: User input

time samtools view -bh ../../HG00097_mapsplice_alignment.sort.bam chr5:6732686-6737489 | bedtools bamtobed -bed12 -i - | bedtools intersect -split -wa -a - -b merged_flanks.bed > timetest.bed

real 0m0.151s user 0m0.035s sys 0m0.014s

nkindlon commented 9 years ago

This was resolved in an off-line conversation. A small query with large DB is faster with samtools due to it's use of indexing. Bedtools does not use indexing because that requires I/O.