Closed mlovci closed 12 years ago
Good point. This is something tricky that comes up a lot when you have files with multiple features in one that overlap a single feature in another and then try to make a Venn diagram -- which has no category for "multiple hits".
The reason you run into this issue is that the BedTool.__add__
method uses the u=True
argument to intersectBed
. (__subtract__
uses v=True
to be symmetrical). If you have nested features (2 features in b that overlap a feature in a) then you'll run into this issue.
For example, what should the 2-way Venn diagram look like for these files? It's not really defined what should go in that middle overlap section of the diagram:
a.bed ------------- --------------------------
b.bed ----------- ----- ------
Using u=True
:
$ intersectBed -a a.bed -b b.bed -u | wc -l 2
$ intersectBed -a b.bed -b a.bed -u | wc -l 3
Not using u=True
results in another issue -- the total number of features for a
overlapping with b
is greater than the number of features in a
in the first place:
$ intersectBed -a a.bed -b b.bed | wc -l 3
$ intersectBed -a b.bed -b a.bed | wc -l 3
This latter version can result in the sum of the each circle in the Venn diagram being huge and having no relation to the original number of features.
Unfortunately, Venn diagrams aren't ideal for overlapping cases like this. In my applications it made sense to use the u=True
case. Do you think it makes more sense to use the u=False
?
If you'd like to play around, you can subclass BedTool and overwrite its __add__
method to do what you want:
class MyBedTool(BedTool):
def __init__(self, *args, **kwargs):
BedTool.__init__(self, *args, **kwargs)
def __add__(self, other):
return self.intersect(other)
Any suggestions on how to improve this? I'd imagine others will run into this as well, so I'll at least have to add this explanation to the docs.
I don't see how to get around this issue that:
(a + b + c) != (b + c + a)
but, we could provide a class method like:
BedTool.intersect_all(*beds, **kwargs)
so something like that recent post on the bedtools list:
def intersect_all(beds, **kwargs):
"""
Successively intersect all files in `beds`, passing kwargs
to BedTool.intersect().
Note that if kwargs like `u=True` or `v=True` are used, the order
of the list will determine the final results
"""
x = BedTool(beds[0])
for bed in beds[1:]:
x = x.intersect(bed, **kwargs)
return x
Won't that have the same problem -- it'll depend on the order of the inputs?
Maybe first you need to do something smart like cat, then merge to get a base. Then intersect each successively with base?
Yep, as noted in the docstring. Really it just passes the buck, leaving decisions up to the caller rather than hard-coding in the overridden __add__
.
But with no kwargs, it should do the plain ol' intersect (without -u), which should be commutative. I'll have to think some more about a cat/merge/intersect strategy.
I have three BED Tools, a b and c, and I expect that (a+b+c).count() should == (b+a+c).count(), since that function is intended to extract features that are common to all three BedTools. (as discussed here: http://packages.python.org/pybedtools/3-brief-examples.html#example-2-intersections-for-a-3-way-venn-diagram) However, in my tests, I find conflicting results, depending on the order of the operands. Also, the total number present in (a+b+c).count() [or any other order] seems to be an over-estimate of the actual number of overlaps, given that I find elements in (a+b+c) that are specific to only b. I can provide BED files for testing if needed.
as a consequence, "venn_mpl.py -c a.BED -a b.BED -b c.BED " gives a different result from "venn_mpl.py -a a.BED -b b.BED -c c.BED "