BioJulia / IntervalTrees.jl

A data structure for efficient manipulation of sets of intervals
MIT License
44 stars 17 forks source link

Changes pertaining to supporting 0.7 #37

Closed sambitdash closed 6 years ago

codecov-io commented 6 years ago

Codecov Report

Merging #37 into version-1.0 will increase coverage by 0.56%. The diff coverage is 93.38%.

Impacted file tree graph

@@               Coverage Diff               @@
##           version-1.0      #37      +/-   ##
===============================================
+ Coverage        93.01%   93.57%   +0.56%     
===============================================
  Files                3        3              
  Lines              787      794       +7     
===============================================
+ Hits               732      743      +11     
+ Misses              55       51       -4
Impacted Files Coverage Δ
src/slice.jl 93.22% <100%> (ø) :arrow_up:
src/map.jl 77.19% <80%> (ø) :arrow_up:
src/IntervalTrees.jl 94.98% <94.3%> (+0.64%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 82bcd10...bbdd40b. Read the comment docs.

ScottPJones commented 6 years ago

The changes for dealing with the isequal/==/>= etc. changes seem fine to me. I sympathize, I myself am having to deal with the changes to === for strings.

sambitdash commented 6 years ago

@ScottPJones I have no concerns as I know the current workarounds will work for now. But architectural decisions have to be taken for #36 by the maintainers on whatever may be best for the 3 types:

AbstractInterval{K}, Interval{K}, IntervalValue{K, V} in terms of the == and hash operator if Dict datastructure needs to be used.

sambitdash commented 6 years ago

The MacOS X build has failed due to a networking error not a real failure due to code changes.

ScottPJones commented 6 years ago

Looks good to me now - it should probably have the commits squashed (although I believe they can be squashed during the merge process)

sambitdash commented 6 years ago

It's just a button click while merging :-).

On Fri, Apr 6, 2018, 8:27 PM Scott P. Jones notifications@github.com wrote:

Looks good to me now - it should probably have the commits squashed (although I believe they can be squashed during the merge process)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BioJulia/IntervalTrees.jl/pull/37#issuecomment-379279507, or mute the thread https://github.com/notifications/unsubscribe-auth/AbIpLVC8wfUaDY0eFt_af6ynzIMwgmiIks5tl4JOgaJpZM4THyhm .

sambitdash commented 6 years ago

@ScottPJones please review https://github.com/BioJulia/IntervalTrees.jl/issues/36#issuecomment-379425432 as well. Although this can be looked at as part of handling #36.

bicycle1885 commented 6 years ago

The overall changes look reasonable to me. However, in my quick benchmarks, I found significant performance degradation (~4-5x slower) in this branch. This may be due to the change of using Union{Nothing,T} instead of Nullable{T} in some data fields. We still need to diagnose this problem more carefully.

master (4e609d30911a0658ae743d0e6753b168b9aee805) (Julia 0.6):

Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
random query
  0.471834 seconds (1.71 M allocations: 56.540 MiB, 2.87% gc time)
  0.227507 seconds (1.56 M allocations: 47.478 MiB, 5.08% gc time)
  0.222676 seconds (1.56 M allocations: 47.478 MiB, 0.82% gc time)
  0.220556 seconds (1.56 M allocations: 47.478 MiB, 0.74% gc time)
  0.211471 seconds (1.56 M allocations: 47.478 MiB, 1.78% gc time)

this pull request (on Julia 0.7-dev):

Julia Version 0.7.0-DEV.4810
Commit a96d847592* (2018-04-07 21:52 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.7.0)
  CPU: Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
Environment:
  JULIA_SHELL = /bin/bash
random query
  1.570659 seconds (16.18 M allocations: 507.332 MiB, 6.57% gc time)
  1.033488 seconds (15.53 M allocations: 467.438 MiB, 1.31% gc time)
  1.016930 seconds (15.53 M allocations: 467.438 MiB, 1.12% gc time)
  1.015666 seconds (15.53 M allocations: 467.438 MiB, 1.18% gc time)
  1.014910 seconds (15.53 M allocations: 467.438 MiB, 1.20% gc time)

benchmark.jl:

if VERSION > v"0.7-"
    push!(LOAD_PATH, Base.NamedEnv("v0.7"))
    using InteractiveUtils
    using Random
    using Profile
end
versioninfo()

using IntervalTrees
using CodecZlib

function readgff3(input::IO, chrom::String)
    intervals = Interval{Int}[]
    for line in eachline(input)
        if startswith(line, "#")
            continue
        end
        values = split(line, '\t')
        if values[1] == chrom
            chromstart = parse(Int, values[4])
            chromend = parse(Int, values[5])
            push!(intervals, Interval(chromstart, chromend))
        end
    end
    sort!(intervals)
    return IntervalTree{Int,Interval{Int}}(intervals)
end

function readgff3(filepath::String, chrom::String)
    stream = open(filepath)
    if endswith(filepath, ".gz")
        stream = GzipDecompressorStream(stream)
    end
    try
        return readgff3(stream, chrom)
    finally
        close(stream)
    end
end

# ftp://ftp.ensembl.org/pub/release-92/gff3/danio_rerio/Danio_rerio.GRCz11.92.gff3.gz
tree = readgff3("data/Danio_rerio.GRCz11.92.gff3.gz", "1")

println("random query")
function random_query(tree, queries)
    n = 0
    for q in queries
        for _ in intersect(tree, q)
            n += 1
        end
    end
    return n
end
srand(1234)
intervals = shuffle!(collect(tree))
for _ in 1:5
    @time random_query(tree, intervals)
end
if "-p" in ARGS
    @profile random_query(tree, intervals)
    Profile.print()
end
sambitdash commented 6 years ago

Can you remove the fast path once and try? This is at a few places.

     if isbits(T)

I felt that was doing a lot number of allocations earlier.

bicycle1885 commented 6 years ago

No, it has no effect on performance.

Julia Version 0.7.0-DEV.4810
Commit a96d847592* (2018-04-07 21:52 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.7.0)
  CPU: Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
Environment:
  JULIA_SHELL = /bin/bash
random query
  1.379991 seconds (16.18 M allocations: 507.394 MiB, 6.40% gc time)
  0.967503 seconds (15.53 M allocations: 467.438 MiB, 1.30% gc time)
  1.000985 seconds (15.53 M allocations: 467.438 MiB, 1.10% gc time)
  0.986747 seconds (15.53 M allocations: 467.438 MiB, 1.20% gc time)
  1.064947 seconds (15.53 M allocations: 467.438 MiB, 1.29% gc time)
diff --git a/src/slice.jl b/src/slice.jl
index 66f9703..1408fc4 100644
--- a/src/slice.jl
+++ b/src/slice.jl
@@ -74,13 +74,13 @@ function Base.insert!(s::Slice{T, N}, i::Integer, value) where {T, N}
     if s.n < N && 1 <= i <= s.n + 1
         # TODO: This should work but won't. Fix this in Julia.
         #copy!(s.data, i+1, s.data, i, s.n - i + 1)
-        if isbits(T)
-            unsafe_copyto!(pointer(s.data, i+1), pointer(s.data, i), s.n - i + 1)
-        else
+        #if isbits(T)
+        #    unsafe_copyto!(pointer(s.data, i+1), pointer(s.data, i), s.n - i + 1)
+        #else
             for k in 0:(s.n - i)
                 @inbounds s.data[s.n+1-k] = s.data[s.n-k]
             end
-        end
+        #end
         @inbounds s.data[i] = value
         s.n += 1
         return s
@@ -122,13 +122,13 @@ end

 function slice_insert!(xs::Vector{T}, count::Integer, i::Integer, value::T) where T
     if count < length(xs) && 1 <= i <= count + 1
-        if isbits(T)
-            unsafe_copyto!(pointer(xs, i+1), pointer(xs, i), count - i + 1)
-        else
+        #if isbits(T)
+        #    unsafe_copyto!(pointer(xs, i+1), pointer(xs, i), count - i + 1)
+        #else
             for k in 0:(count - i)
                 @inbounds xs[count+1-k] = xs[count-k]
             end
-        end
+        #end
         @inbounds xs[i] = value
         count += 1
         return count
sambitdash commented 6 years ago

Nullable{T} is no longer in 0.7. There does not seem to be any alternate then, unless we will like to add own version of Nullable{T}.

bicycle1885 commented 6 years ago

This performance degradation seems to be due to the failure of type inference in Julia. The following patch solves the problem.

diff --git a/src/IntervalTrees.jl b/src/IntervalTrees.jl
index e2e148a..05c6776 100644
--- a/src/IntervalTrees.jl
+++ b/src/IntervalTrees.jl
@@ -1339,7 +1339,7 @@ function nextintersection!(t::LeafNode{K, V, B}, i::Integer,

         t.right === nothing && break

-        t = t.right
+        t = t.right::LeafNode{K,V,B}

         if minstart(t) > last(query)
             break

9445c93f9754062dfa1f7944dff3e61b9c3d4eb0 + patch:

Julia Version 0.7.0-DEV.4810
Commit a96d847592* (2018-04-07 21:52 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.7.0)
  CPU: Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
Environment:
  JULIA_SHELL = /bin/bash
┌ Warning: `trunc(x::Number, digits)` is deprecated, use `trunc(x; digits=digits)` instead.
│   caller = stale_cachefile(::String, ::String) at loading.jl:1391
└ @ Base loading.jl:1391
random query
  0.747452 seconds (2.00 M allocations: 80.573 MiB, 10.57% gc time)
  0.151449 seconds (1.40 M allocations: 42.690 MiB, 3.20% gc time)
  0.129592 seconds (1.40 M allocations: 42.690 MiB, 1.51% gc time)
  0.128382 seconds (1.40 M allocations: 42.690 MiB, 0.94% gc time)
  0.132978 seconds (1.40 M allocations: 42.690 MiB, 0.89% gc time)
ScottPJones commented 6 years ago

That's very interesting. That may in part be why I saw such poor performance with the changes to the find* functions to use nothing as a sentinel value.

bicycle1885 commented 6 years ago

The type of t is apparent to human beings, but it may not be so to the compiler. This may be a bug of Julia. Adding type annotations is enough now as a workaround.

I think this kind of problems is here and there. We need some benchmark suite to detect them.

sambitdash commented 6 years ago

Or may be just leave the PR for now and have Julia get stable.

sambitdash commented 6 years ago

After adding notnothing to where relevant.

julia> include("benchmark.jl")
Julia Version 0.7.0-DEV.4736
Commit 8131c75a91 (2018-04-01 20:23 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
Environment:
random query
  0.529148 seconds (2.63 M allocations: 97.357 MiB, 14.03% gc time)
  0.138258 seconds (2.07 M allocations: 63.062 MiB, 4.07% gc time)
  0.135348 seconds (2.07 M allocations: 63.062 MiB, 3.61% gc time)
  0.136543 seconds (2.07 M allocations: 63.062 MiB, 3.22% gc time)
  0.140009 seconds (2.07 M allocations: 63.062 MiB, 3.27% gc time)

Now everything is fixed.

sambitdash commented 6 years ago

@ScottPJones, @bicycle1885 re-editing this note as I have some additional observations. notnothing is eliminating the type instability just that @code_warntype is highlighting in red. (Line 41 below in bold) But that code will never be reached. I like notnothing over a variable assignment as it's like a compiler directive.

But if you look at all variables there is no overloaded types. So if you agree, we could go ahead with the existing code. Assigning to temporary variable will be hard to maintain in the long run unless it will be fixed by the compiler later.

julia> function rightmost(t::Node)
           while true
               if t.right == nothing
                   break
               end
               t = Base.notnothing(t.right)
           end
           return t
       end
rightmost (generic function with 1 method)

julia> @code_warntype rightmost(node)
Variables:
  t@_2::Node{Int64}
  t@_3::Node{Int64}
  #temp#@_4::Bool
  #temp#@_5::Node{Int64}

Body:
  begin
      t@_3::Node{Int64} = t@_2::Node{Int64}
      2: 
      #= line 3 =#
      # meta: location sysimg.jl getproperty 18
      Core.SSAValue(4) = (Base.getfield)(t@_3::Node{Int64}, :right)::Union{Nothing, Node{Int64}}
      # meta: pop location
      Core.SSAValue(5) = (Core.SSAValue(4) isa Nothing)::Bool
      unless Core.SSAValue(5) goto 11
      #temp#@_4::Bool = $(Expr(:invoke, MethodInstance for ==(::Nothing, ::Nothing), :(Main.:==), Core.SSAValue(4), :(Main.nothing)))::Bool
      goto 20
      11: 
      Core.SSAValue(6) = (Core.SSAValue(4) isa Node{Int64})::Bool
      unless Core.SSAValue(6) goto 16
      #temp#@_4::Bool = $(Expr(:invoke, MethodInstance for ==(::Node{Int64}, ::Nothing), :(Main.:==), Core.SSAValue(4), :(Main.nothing)))::Bool
      goto 20
      16: 
      goto 18
      18: 
      (Base.error)("fatal error in type inference (type bound)")::Union{}
      20: 
      Core.SSAValue(1) = #temp#@_4::Bool
      unless Core.SSAValue(1) goto 25
      #= line 4 =#
      goto 47
      25: 
      #= line 6 =#
      # meta: location sysimg.jl getproperty 18
      Core.SSAValue(7) = (Base.getfield)(t@_3::Node{Int64}, :right)::Union{Nothing, Node{Int64}}
      # meta: pop location
      Core.SSAValue(8) = (Core.SSAValue(7) isa Nothing)::Bool
      unless Core.SSAValue(8) goto 34
      #temp#@_5::Node{Int64} = $(Expr(:invoke, MethodInstance for notnothing(::Nothing), Base.notnothing, Core.SSAValue(7)))::Node{Int64}
      goto 43
      34: 
      Core.SSAValue(9) = (Core.SSAValue(7) isa Node{Int64})::Bool
      unless Core.SSAValue(9) goto 39
      #temp#@_5::Node{Int64} = $(Expr(:invoke, MethodInstance for notnothing(::Node{Int64}), Base.notnothing, Core.SSAValue(7)))::Node{Int64}
      goto 43
      39: 
      goto 41

41: (Base.error)("fatal error in type inference (type bound)")::Union{}

      43: 
      t@_3::Node{Int64} = #temp#@_5::Node{Int64}
      45: 
      goto 2
      47: 
      #= line 8 =#
      return t@_3::Node{Int64}
  end::Node{Int64}
sambitdash commented 6 years ago

removed last comment I added as the code was wrong.

sambitdash commented 6 years ago

@ScottPJones and @bicycle1885 Do you think we can go ahead with notnothing?

dcjones commented 6 years ago

This looks good to me.

Should we bump REQUIRE to julia 0.7? I would guess that using nothing in place of Nullable would be such a big performance regression in 0.6, that it might not be worth supporting it at all.

ScottPJones commented 6 years ago

Bump REQUIRE to v0.7, and merge this?

bicycle1885 commented 6 years ago

Agreed. The release of Julia 0.7-beta2 is almost there (maybe in a few days?), and so I think it would be better to wait for it and then bump the requirement to 0.7-beta2.

phaverty commented 6 years ago

The second beta is out :-)

bicycle1885 commented 6 years ago

Started rebuilding.

TransGirlCodes commented 6 years ago

Thanks for this @sambitdash and everyone, I'll merge this into the version-1.0 branch and update the julia REQUIRE version. I'll update the project files and add Documenter.jl docs over on that branch. Thanks for the nice work @sambitdash :+1:

sambitdash commented 6 years ago

Thanks a lot for merging the changes.