Genomicus / bedtools

Automatically exported from code.google.com/p/bedtools
0 stars 0 forks source link

overlap reported with -wo does not honor -split #165

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run intersectBed -split -bed -wao -abam test.bam -b test.gtf
with test.bam containing a spliced read (N in CIGAR) and test.gtf containing a 
feature (exon, etc.) that partly overlaps at least one exon of the read in 
test.bam, as well as at least part of the intron.

Example:
Relevant content of test.bam (shown as SAM):
4901P:09827:06876       16      chr7    52382101        33      64M196N28M ....
(Note that the actual reads maps to coordinates chr7:52382101-52382164 and
chr7:52382361-52382388; the first part has no overlap with the feature
below, the second only one nucleotide, at pos. 52382361)

line in test.gtf:
chr7    refGene exon    52382288        52382361        0.000000        -
.       gene_id "Mir5121"; transcript_id "NR_039581";

What is the expected output? What do you see instead?
The -wao option returns the number of nucleotides by which read and feature 
overlap in the last column.  With -split in effect, this should include only 
exons of the read.  In the example above, I expect "1" (one) in the last column.
Regardless of the use of -split, intersectBed returns "74" in the last column:
Result of "intersectBed -split -bed -wao -abam test.bam -b test.gtf":
chr7    52382100        52382388        4901P:09827:06876       33      -
52382100        52382388        0,0,0   2       64,28,  0,260,  chr7
refGene exon    52382288        52382361        0.000000        -       .
gene_id "Mir5121"; transcript_id "NR_039581";   74

This includes the intron-part of the read for the length of its overlap with 
the feature, even though the -split option should ignore introns (this works as 
long as there is NO overlap with any exon-part of the read).

What version of the product are you using? On what operating system?
v2.18.0-4-geea3909 and v2.17.0-123-gb0155b2
on Linux (Debian wheezy and RedHat 5)

Please provide any additional information below.
I have reported this previously in an e-mail to Aaron, so I apologize for any 
duplication, but since it is not fixed yet and I don't want it to get lost, I 
am submitting it here as well.  Thanks!

Original issue reported on code.google.com by tdanh...@gmail.com on 17 Dec 2013 at 12:13

GoogleCodeExporter commented 9 years ago
The overlap calculation not honoring -split also affects the -f option, i.e. 
the overlap is always compared to a fraction of the read including any introns, 
which can make the threshold very hard to meet in case of long introns.

Original comment by tdanh...@gmail.com on 17 Dec 2013 at 7:14

GoogleCodeExporter commented 9 years ago
Ran into the same problem. Without a fix on this, it will be really hard to use 
this for any RNA-seq data. It seems this problem has been reported for a while 
now. Could we get a fix soon?

Original comment by simonxi...@gmail.com on 9 Jan 2014 at 3:26

GoogleCodeExporter commented 9 years ago
This has been fixed - thanks!  Tested with v2.20.1

Original comment by tdanh...@gmail.com on 3 Jun 2014 at 8:24