BioJulia / Bio.jl

[DEPRECATED] Bioinformatics and Computational Biology Infrastructure for Julia
http://biojulia.dev
MIT License
261 stars 65 forks source link

Discriminate between exact Interval matches and overlaps #471

Closed phaverty closed 5 years ago

phaverty commented 7 years ago

In interval searching operations, Bioconductor treats exact matches differently than overlap matches (and the various kinds of overlap can be tested for too). Is there interest in having two kinds of overlap iterators, one that gives exact matches and one that gives overlapping matches?

dcjones commented 7 years ago

A simple way to do this is just

filter((x, y) -> first(x) == first(y) && last(x) == last(y), intersect(xs, ys))

We could add that definition to the library for convenience, but a custom "exact" iterator would be much faster, if that matters.

bicycle1885 commented 7 years ago

Adding some keyword or optional keyword to eachoverlap may be handy. There is a standard vocabulary to denote the state of two intervals (https://en.wikipedia.org/wiki/Allen%27s_interval_algebra). So, the interface may look like eachoverlap(a, b, rel=:equal).

dcjones commented 7 years ago

I'm coincidentally in need of a richer set of query operations in the project I'm working on, so I may tackle this.

My preference is for eachoverlap (and findfirst) to optionally take a predicate function that filters overlaps. This would be faster than filter(eachoverlap(...), ...), and let us capture a more relationships specific to genomic intervals, like matching/mismatch strand, or comparing specific metadata fields.

"Meets" and "precedes" are the only relationships in Allen's interval algebra that aren't a type of overlap. "Meets" is pretty easy to do by extending the query by one base. "Precedes" seems like of questionable usefulness, if you want that you can just iterate through the IntervalCollection, since it's sorted.

We can still use something like eachoverlap(a, b, rel=:equal) but as an alias for essentially eachoverlap(a, b, filter=(u,v) -> u.first == v.first && u.last == v.last).

TransGirlCodes commented 7 years ago

Should this be done with IntervalTrees.jl first and then extended in GenomicFeatures.jl since it defines the abstract interval and the basic IntervalTree collection? This could be a minor version number increment for both of those.

dcjones commented 7 years ago

Yeah, most the work needs to done on IntervalTrees.jl, so I'll focus on that first.

bicycle1885 commented 7 years ago

Yeah, filter function approach sounds better.

phaverty commented 7 years ago

This is great! I'm looking forward to trying it out.