BioJulia / IntervalTrees.jl

A data structure for efficient manipulation of sets of intervals
MIT License
44 stars 17 forks source link

AbstractIntervalTree #10

Open phaverty opened 9 years ago

phaverty commented 9 years ago

Would you be open to having an AbstractIntervalTree? I'd like to experiment with a GenomicRanges type and I'd like to inherit from AbstractIntervalTree.

dcjones commented 9 years ago

I'm open to this. Is there a common interface for children of AbstractIntervalTree that you have in mind?

phaverty commented 9 years ago

I'd like to see it implement the Set API for exact matches and also the ranges operations from IRanges. I'm happy to submit a PR with the full set, but this includes an Interval API that parallels the Set API with overlap replacing equality. It also contains things like "nearest".

IRanges etc. also implement the Vector API (length, setindex, getindex, etc.). The Vector part might not be what you had in mind. IRanges has both a linear and a tree representation, which has some convenience/complexity trade-offs. Importantly, it lets you map specific ranges to rows of a DataFrame.

I submitted a few tiny PRs to IntervalTree that fill in what I thought was missing functionality. I hope you find them useful.

Pete


Peter M. Haverty, Ph.D. Genentech, Inc. phaverty@gene.com

On Tue, Jun 9, 2015 at 7:03 AM, Daniel C. Jones notifications@github.com wrote:

I'm open to this. Is there a common interface for children of AbstractIntervalTree that you have in mind?

Reply to this email directly or view it on GitHub https://github.com/BioJulia/IntervalTrees.jl/issues/10#issuecomment-110434764 .

phaverty commented 9 years ago

, but for now, I'd be stoked just to have the AbstractIntervalTree type and to have the current IntervalTree methods switched to AbstractIntervalTree.

Pete


Peter M. Haverty, Ph.D. Genentech, Inc. phaverty@gene.com

On Tue, Jun 9, 2015 at 11:13 AM, Peter Haverty phaverty@gene.com wrote:

I'd like to see it implement the Set API for exact matches and also the ranges operations from IRanges. I'm happy to submit a PR with the full set, but this includes an Interval API that parallels the Set API with overlap replacing equality. It also contains things like "nearest".

IRanges etc. also implement the Vector API (length, setindex, getindex, etc.). The Vector part might not be what you had in mind. IRanges has both a linear and a tree representation, which has some convenience/complexity trade-offs. Importantly, it lets you map specific ranges to rows of a DataFrame.

I submitted a few tiny PRs to IntervalTree that fill in what I thought was missing functionality. I hope you find them useful.

Pete


Peter M. Haverty, Ph.D. Genentech, Inc. phaverty@gene.com

On Tue, Jun 9, 2015 at 7:03 AM, Daniel C. Jones notifications@github.com wrote:

I'm open to this. Is there a common interface for children of AbstractIntervalTree that you have in mind?

Reply to this email directly or view it on GitHub https://github.com/BioJulia/IntervalTrees.jl/issues/10#issuecomment-110434764 .

dcjones commented 9 years ago

Sorry I missed your PRs. They're merged now.

A problem with implementing a Set API is that the implementation here is not a set. It's perfectly valid to insert two identical intervals. We could have getindex just return the first match in such a case, or it could return an iterator over all matching intervals.

But I wonder if a more fundamental operation isn't intersect(target::AbstractIntervalTree, query::AbstractInterval) returning an array or iterator of intervals in target that intersect query.

We might also call the base class AbstractIntervalCollection or something, since it need not be a tree.

phaverty commented 9 years ago

No worries on the PRs. You are welcome to ignore the ones you don't like, of course. It's your show. Thanks for merging them.

Good point about the AbstractIntervalCollection. I've been mulling over a GenomePosition type that would just be a vector with some metadata, so that would fit your new supertype.

For GenomePosition (and GenomicRanges) I was thinking of converting all the positions to genome position by adding an offset into the concatenated, linear genome. That would give a single tree for easy intersection, sorting, etc.. I'd keep the offsets in the object for converting back to chr/start/stop coordinates, but I don't think one would want to do that often, actually. I use this linear genome position thing for plotting in the genoset package. The genome offset metadata would be immutable, and intersect operations would require two object to have identical genome metadata.

Yes, I agree that the intersect operation is the key operation for this family of types.

Pete


Peter M. Haverty, Ph.D. Genentech, Inc. phaverty@gene.com

On Wed, Jun 10, 2015 at 7:40 AM, Daniel C. Jones notifications@github.com wrote:

Sorry I missed your PRs. They're merged now.

A problem with implementing a Set API is that the implementation here is not a set. It's perfectly valid to insert two identical intervals. We could have getindex just return the first match in such a case, or it could return an iterator over all matching intervals.

But I wonder if a more fundamental operation isn't intersect(target::AbstractIntervalTree, query::AbstractInterval) returning an array or iterator of intervals in target that intersect query.

We might also call the base class AbstractIntervalCollection or something, since it need not be a tree.

Reply to this email directly or view it on GitHub https://github.com/BioJulia/IntervalTrees.jl/issues/10#issuecomment-110847857 .

blahah commented 9 years ago

Just as an aside, while @dcjones has done all the (phenomenal) work on this repo, this is technically nobody's show. BioJulia is a community endeavour and we would love to have you on board @phaverty.

phaverty commented 7 years ago

I have released GenomicVectors which implements my idea for storing ranges along the linearized genome.