(a+b+c) is not commutative and overestimates overlap

mlovci commented 12 years ago

I have three BED Tools, a b and c, and I expect that (a+b+c).count() should == (b+a+c).count(), since that function is intended to extract features that are common to all three BedTools. (as discussed here: http://packages.python.org/pybedtools/3-brief-examples.html#example-2-intersections-for-a-3-way-venn-diagram) However, in my tests, I find conflicting results, depending on the order of the operands. Also, the total number present in (a+b+c).count() [or any other order] seems to be an over-estimate of the actual number of overlaps, given that I find elements in (a+b+c) that are specific to only b. I can provide BED files for testing if needed.

as a consequence, "venn_mpl.py -c a.BED -a b.BED -b c.BED " gives a different result from "venn_mpl.py -a a.BED -b b.BED -c c.BED "

daler commented 12 years ago

Good point. This is something tricky that comes up a lot when you have files with multiple features in one that overlap a single feature in another and then try to make a Venn diagram -- which has no category for "multiple hits".

The reason you run into this issue is that the BedTool.__add__ method uses the u=True argument to intersectBed. (__subtract__ uses v=True to be symmetrical). If you have nested features (2 features in b that overlap a feature in a) then you'll run into this issue.

For example, what should the 2-way Venn diagram look like for these files? It's not really defined what should go in that middle overlap section of the diagram:

a.bed  -------------        --------------------------
b.bed      -----------               -----     ------

Using u=True:

$ intersectBed -a a.bed -b b.bed -u | wc -l 2

$ intersectBed -a b.bed -b a.bed -u | wc -l 3

Not using u=True results in another issue -- the total number of features for a overlapping with b is greater than the number of features in a in the first place:

$ intersectBed -a a.bed -b b.bed | wc -l 3

$ intersectBed -a b.bed -b a.bed | wc -l 3

This latter version can result in the sum of the each circle in the Venn diagram being huge and having no relation to the original number of features.

Unfortunately, Venn diagrams aren't ideal for overlapping cases like this. In my applications it made sense to use the u=True case. Do you think it makes more sense to use the u=False?

If you'd like to play around, you can subclass BedTool and overwrite its __add__ method to do what you want:

class MyBedTool(BedTool):
    def __init__(self, *args, **kwargs):
        BedTool.__init__(self, *args, **kwargs)
    def __add__(self, other):
        return self.intersect(other)

Any suggestions on how to improve this? I'd imagine others will run into this as well, so I'll at least have to add this explanation to the docs.

brentp commented 12 years ago

I don't see how to get around this issue that:

(a + b + c) != (b + c + a)

but, we could provide a class method like:


BedTool.intersect_all(*beds, **kwargs)

daler commented 12 years ago

so something like that recent post on the bedtools list:

def intersect_all(beds, **kwargs):
    """
    Successively intersect all files in `beds`, passing kwargs
    to BedTool.intersect().

    Note that if kwargs like `u=True` or `v=True` are used, the order 
    of the list will determine the final results
    """
    x = BedTool(beds[0])
    for bed in beds[1:]:
        x = x.intersect(bed, **kwargs)
    return x

brentp commented 12 years ago

Won't that have the same problem -- it'll depend on the order of the inputs?

Maybe first you need to do something smart like cat, then merge to get a base. Then intersect each successively with base?

daler commented 12 years ago

Yep, as noted in the docstring. Really it just passes the buck, leaving decisions up to the caller rather than hard-coding in the overridden __add__.

But with no kwargs, it should do the plain ol' intersect (without -u), which should be commutative. I'll have to think some more about a cat/merge/intersect strategy.

daler / pybedtools

(a+b+c) is not commutative and overestimates overlap #45