BioJulia / GenomicFeatures.jl

Tools for genomic features in Julia.
Other
32 stars 13 forks source link

Ordering of genomic intervals only works as expected when parameterised by the same type #44

Open owensnick opened 2 years ago

owensnick commented 2 years ago

Comparison of intervals isless, isordered, precedes of intervals is defined for intervals with identical type parameters for their metadata, for example:

Base.isless(a::Interval{T}, b::Interval{T}, seqname_isless::Function=isless) where T

This causes some surprises when comparing intervals with different parametric types as it falls back on basic_isless defined in IntervalTrees.jl and does not compare seqname:

## Comparing Interval{String} with Interval{String} 
julia> Interval("chr2", 1, 10, '.', "") < Interval("chr1", 11, 20, '.', "")
false  # expected
julia> @which Interval("chr2", 1, 10, '.', "") < Interval("chr1", 11, 20, '.', "")
isless(a::Interval{T}, b::Interval{T}) where T in GenomicFeatures at ... .julia/packages/GenomicFeatures/bVCwr/src/interval.jl:82

## Comparing Interval{Int64} with Interval{String} 
julia> Interval("chr2", 1, 10, '.',  1) < Interval("chr1", 11, 20, '.', "")
true # not expected, only compares first and last
julia> @which isless(Interval("chr2", 1, 10, '.', 1), Interval("chr1", 11, 20, '.', ""))
isless(u::IntervalTrees.AbstractInterval, v::IntervalTrees.AbstractInterval) in IntervalTrees at  ... .julia/packages/IntervalTrees/wh2ex/src/IntervalTrees.jl:30

It would be super useful to be able to order intervals when their metadata was of different types. The most obvious change would be to allow different type parameters: Base.isless(a::Interval{T}, b::Interval{V}, seqname_isless::Function=isless) where {T, V}

However, could this cause issues breaking ordering equality conditions, as it makes sense to define isequal for two intervals that have equal metadata. So for two intervals x and y with identical coordinates but different metadata you would have all of isless(x, y), isless(y, x) and isequal(x, y) being false.

My use case is that I want to calculate distances between intervals with different metadata and using functions like sort and searchsorted are useful in doing this.

Would be happy if isordered was changed to isordered(a::Interval{T}, b::Interval{V}, seqname_isless::Function=isless) where {T, V}, and then use: searchsorted(intervals, iv, lt=GenomicFeatures.isordered), unless anyone sees any issues in doing so?

CiaranOMara commented 2 years ago

I think I've got upstream code that addresses your use case. But let me know if it doesn't and I'll have another look at your suggestion. I've linked the relevant code below.

https://github.com/BioJulia/GenomicFeatures.jl/blob/2bad5693c4e8c09c8c5c5e8c92f7c6005c6567a1/src/interval.jl#L86-L137

If you want to test the code, add the following packages.

add https://github.com/CiaranOMara/IntervalTrees.jl#pu/v1.1.0
add GenomicFeatuers#pu/v3.0.0

The main changes are that the Interval type is renamed GenomicInterval and now subtypes AbstractGenomicInterval; and IntervalCollection is now GenomicIntervalCollection and the element type of the collection is now the AbstractGenomicInterval instead of the interval's metadata type. The rest should be familiar.

These branches are fairly stable as I'm using them for my own work. The reason they haven't been merged into master is that it doesn't seem correct/reasonable to run ahead with GenomicFeatures v3 before bring all of the other packages up to date with GenomicFeatures v2.

CiaranOMara commented 2 years ago

I've moved the relevant code down to the head of the develop branch. So the functionality you require without the new types is now also an option for you with dev GenomicFeatures.