bedCoverage and spliced reads

Distue commented 8 years ago

Dear authors and community,

recently I encountered a strange behaviour of bedCoverage which leads to very confusing results and I hope you can address this issue.

bedCoverage provides a function for counting reads (.bam file) in an interval (.bed file). I am using this to count in various specific regions in RNASeq data (parameters -counts for counts only and -s for the strand-specific setup we have, to mention it is paired-end).

I encountered strange counts when checking the consistency of the counting. For example in this interval in IGV browser (v2.3.67) there is only one read, however bedCoverage (v2.17 as well as v2.25) reports 6 reads in this interval.

The same interval in GenomeBrowse (v.2.1.0) shows one read and 4 spliced alignments, which does not add up to 6, however gave a hint where this counts might come from.

When converting the bam to bed(6) and then extracting the alignments spanning over the interval, the 5 missing reads show up.

perl -lane 'print join("\t", @F) if $F[0] == 1 && $F[1] < 911879 && $F[2] > 912004' RNASeq.bed

5 spliced + 1 mapped = 6

Same thing I saw in different scenarios.

The conclusion:

bedCoverage counts reads which do not map even close to the interval. The alignment split spans over 200kb in this cases. Therefore intervals with no reads get an artificial signal. The interval between spliced reads is not supposed to be in a fragment as well.

Is this a bug, a feature or did I overlook some important detail?

Thanks!

dpryan79 commented 8 years ago

The -split option should have taken care of this in earlier versions. However, 2.25.0 seems to produce incorrect results even when -split is used. Here's a relevant post on biostars with an example that can be downloaded for testing.

arq5x commented 8 years ago

Thanks for reporting - I just pushed a fix for this and will respond on biostars.org as well

dpryan79 commented 8 years ago

:+1: Thanks!

arq5x / bedtools2

bedCoverage and spliced reads #354