BioJulia / IntervalTrees.jl

A data structure for efficient manipulation of sets of intervals
MIT License
44 stars 17 forks source link

Chromosome and strand? #9

Closed phaverty closed 9 years ago

phaverty commented 9 years ago

Have you given some thought to how you would handle chromosome and strand? A 4-tuple ChrRange type (chr,start,end,strand)? Maybe sort with a special Ordering that will give the expected chromosome order (1-22, MT, X, Y)? Alternatively, you might associate a Vector of chromosome names, like GenomicRanges does. Having a list of IRanges, like in RangedData, seemed like a good idea, but people really want arbitrary ordering without blocking by chromosome.

Opinions vary about how to handle strand for sorting, intersecting, etc., so you may want to read the Bioconductor mailing list archives to take the community's pulse on that.

It might also be nice to have a ChrPos represented by a 2- or 3- tuple (chr, start) or (chr,start,strand).

dcjones commented 9 years ago

I have given some thought to a lot of this. IntervalTrees was intended as a lower level data structure, which we used to build IntervalCollection in Bio.jl, which supports chromosome and strand.

We do have a special sort function to put chromosome names in a natural order, which IntervalCollection does by default. We should also support arbitrary ordering of chromosome names, but that's not quite implemented (or maybe just not merged yet, I don't remember).

We are currently not sorting on stand in anyway. Maybe there's a reason to do so, I'm not sure. For intersection, strand is not considered, but intersect returns an iterator, so it's easy to say

filter((x,y) -> x.strand == y.strand, intersect(xs, ys))

We may end up adding a case for strand, if speed becomes a concern with that approach though.

I don't think we've solved everything with IntervalCollection yet, but definitely check that out and see what's lacking. Everyone has different use cases, so it'd be quite helpful to get input on it's design. IntervalTrees on the other hand is pretty much feature complete as far as it's intended purpose goes.

phaverty commented 9 years ago

Thanks for the quick reply. I see that I overlooked a lot of relevant code in Bio.jl. I'm eager to check it out. My favored design for something like this would be a subclass of DataFrame with a GRanges as a fancy index. The current Bioconductor GRanges is a bit schizophrenic as it wants to be a vector and also a DataFrame (1-d and also 2-d). I see from your diagram that you have linked lists attached to each IntervalTree node. Do you intend for each node to have the same set of things in its metadata? It might be simpler/faster to keep the metadata in a separate 2-d data structure, if so.

Thanks for the pointer to IntervalCollection and keep up the good work. R's GenomicRanges is the workhorse object in our department. I'm glad to see Julia is getting a solid implementation of a similar type.

Regards,

Pete


Peter M. Haverty, Ph.D. Genentech, Inc. phaverty@gene.com

On Fri, Jun 5, 2015 at 10:46 AM, Daniel C. Jones notifications@github.com wrote:

I have given some thought to a lot of this. IntervalTrees was intended as a lower level data structure, which we used to build IntervalCollection https://github.com/BioJulia/Bio.jl/blob/master/src/intervals/intervalcollection.jl in Bio.jl, which supports chromosome and strand.

We do have a special sort function https://github.com/BioJulia/Bio.jl/blob/master/src/intervals/interval.jl#L111 to put chromosome names in a natural order, which IntervalCollection does by default. We should also support arbitrary ordering of chromosome names, but that's not quite implemented (or maybe just not merged yet, I don't remember).

We are currently not sorting on stand in anyway. Maybe there's a reason to do so, I'm not sure. For intersection, strand is not considered, but intersect returns an iterator, so it's easy to say

filter((x,y) -> x.strand == y.strand, intersect(xs, ys))

We may end up adding a case for strand, if speed becomes a concern with that approach though.

I don't think we've solved everything with IntervalCollection yet, but definitely check that out and see what's lacking. Everyone has different use cases, so it'd be quite helpful to get input on it's design. IntervalTrees on the other hand is pretty much feature complete as far as it's intended purpose goes.

Reply to this email directly or view it on GitHub https://github.com/BioJulia/IntervalTrees.jl/issues/9#issuecomment-109377107 .

dcjones commented 9 years ago

Making IntervalCollection work like a DataFrame is an intriguing idea. I don't know in practical terms if it can share much code with DataFrames, but the familiar interface could help a lot. I've only used GenomicRanges a little. My superficial impression was that it's very powerful, but a little hard to one's my head around at first, so I'm eager to learn more.

That diagram is actually a little out of date. I ended up abandoning the linked list of metadata because it did end up being too slow in some tests. It's still the case that every interval has the same metadata type, but now it's stored directly in the IntervalTree, which I changed to allow multiple identical intervals with different metadata. That way it has most of the benefits of what you suggest with the metadata stored in blocks of contiguous memory in the B-tree leaf nodes.