Performance of ChunkSplitters drops because it produces `x:1:y` instead of `x:y`

Tortar commented 4 months ago

I initially described the issue here: https://github.com/JuliaLang/julia/issues/54818. But I actually found this bottleneck in ChunkSplitter.jl when operating in multi-threading on views. Now:

julia> using ChunkSplitters

julia> collect(chunks(1:10; n=2))
2-element Vector{StepRange{Int64, Int64}}:
 1:1:5
 6:1:10

I think instead it would be much faster in some cases if it returned

2-element Vector{UnitRange{Int64}}:
 1:5
 6:10

lmiq commented 4 months ago

The reason for returning the range with the step is because the step might be different from 1 if the scatter option is chosen. And if we return different range types in each case we introduce a type instability.

We could address that, dispatching on chunk types. But do you have an example where that clearly affects performance?

(I'm not sure, but the necessary changes might be breaking)

Tortar commented 4 months ago

This is it starting julia with julia --threads=2:

using BenchmarkTools
using ChunkSplitters
using StreamSampling

a = collect(1:10^7)
c1 = chunks(a; n = 2)
c2 = [(1, 1:5000000), (2, 5000001:10000000)]

function mwe_c1(a, c)
    s = Vector{Vector{Int}}(undef, 2)
    Threads.@threads for (i, v) in enumerate(c)
        s[i] = itsample(@view(a[v]), 1)
    end
    return s
end
function mwe_c2(a, c)
    s = Vector{Vector{Int}}(undef, 2)
    Threads.@threads for (i, v) in c
        s[i] = itsample(@view(a[v]), 1)
    end
    return s
end

@benchmark mwe_c1($a, $c1)
@benchmark mwe_c2($a, $c2)

I see:

julia> @benchmark mwe_c1($a, $c1)
BenchmarkTools.Trial: 591 samples with 1 evaluation.
 Range (min … max):  362.941 μs … 12.869 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       8.862 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     8.456 ms ±  2.854 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                 ▁  ▄ ▁▂ ▁▂▄▁▂▂ ▁▄   ▆ █▄▇▂▂▄   
  ▃▃▃▃▃▃▁▄▅▁▃▃▅▅▄▅▇▆▆▅▅▅▆▄█▅▄▆████▆██▆█████████████▆█████████▇ ▅
  363 μs          Histogram: frequency by time         12.5 ms <

 Memory estimate: 1.47 KiB, allocs estimate: 18.

julia> @benchmark mwe_c2($a, $c2)
BenchmarkTools.Trial: 4051 samples with 1 evaluation.
 Range (min … max):  46.218 μs …   2.398 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      1.306 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):    1.231 ms ± 431.481 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                       ▁      ▂▁▁▂▂▂▃▃▄▁▄▃▃▄▆▄▇▅▄▆▇█▇▇▇██▇▆▂    
  ▂▁▂▂▂▃▃▄▄▄▄▄▅▅▄▅▆▆▅▇▇█▇▇▇▇████████████████████████████████▆▃ ▆
  46.2 μs         Histogram: frequency by time         1.89 ms <

 Memory estimate: 1.44 KiB, allocs estimate: 18.

lmiq commented 4 months ago

I made an attempt to see if union splitting could solve this issue without introducing instabilities. The attempt is in this branch:

https://github.com/JuliaFolds2/ChunkSplitters.jl/tree/UnitRange_for_batch

This version solves the performance issue reported here, because it returns an UnitRange{Int} if split=:batch and a StepRange{Int,Int} if split=:scatter.

Now I have a upper user-facing function, which converts the split::Symbol parameter into a type parameter and dispatches with that all the way.

Nevertheless, this introduces an inference problem (a potential type-instability, although marked in "yellow" in @code_warntype), that manifests itself whenever we try to set the split parameter:

julia> using ChunkSplitters, Test

julia> @inferred chunks(1:7; n=4, split=:batch)
ERROR: return type ChunkSplitters.Chunk{UnitRange{Int64}, ChunkSplitters.BatchSplitter, ChunkSplitters.FixedCount} does not match inferred return type Union{ChunkSplitters.Chunk{UnitRange{Int64}, ChunkSplitters.BatchSplitter, ChunkSplitters.FixedCount}, ChunkSplitters.Chunk{UnitRange{Int64}, ChunkSplitters.ScatterSplitter, ChunkSplitters.FixedCount}}
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] top-level scope
   @ REPL[9]:1

And this brings back this issue: https://github.com/JuliaFolds2/ChunkSplitters.jl/issues/10

Which is where I decided to return the same type of range in all cases.

I don't know if there is a better solution. I'm more afraid of propagating a type-instability to other people's code than of this performance issue specifically. The Base.@constprop :agressive option suggested in the other issue does not seem to solve the issue here. But since ChunkSplitters is used behind OhMyThreads, many people might experience this performance loss without noticing.

I'm not sure if the difference in performance between using the different types of ranges in the case of step == 1 should not something to be dealt upstream in fact.

@carstenbauer and @MasonProtter , do you want to chime in?

lmiq commented 4 months ago

FWIW, the key here is that inference related to the split keyword parameter does not propagate through this function:

https://github.com/JuliaFolds2/ChunkSplitters.jl/blob/f79f2370c4fc3fc24f97d3834c0d83f6094c967c/src/ChunkSplitters.jl#L111

Down there, everything is fine.

This is the issue, in a much simpler case:

julia> function f(x; y::Symbol)
           if y == :one
               return 1
           elseif y == :two
               return 2.0
           else
               error()
           end
       end
f (generic function with 1 method)

julia> @inferred f(1; y=:two)
ERROR: return type Float64 does not match inferred return type Union{Float64, Int64}

We could change the API, to be:

julia> @inferred chunks(1:7, BatchSplitter; n=4)
ChunkSplitters.Chunk{UnitRange{Int64}, BatchSplitter, ChunkSplitters.FixedCount}(1:7, 4, 0)

julia> @inferred chunks(1:7, ScatterSplitter; n=4)
ChunkSplitters.Chunk{UnitRange{Int64}, ScatterSplitter, ChunkSplitters.FixedCount}(1:7, 4, 0)

That would require just removing the method that accepts split as a Symbol. Of course, all breaking. BatchSplitter would be the default (positional) parameter.

Tortar commented 4 months ago

I think you know but I would notice that

f() = chunks(1:7; n=4, split=:batch)

works okay.

julia> @inferred f()
ChunkSplitters.Chunk{UnitRange{Int64}, ChunkSplitters.BatchSplitter, ChunkSplitters.FixedCount}(1:7, 4, 0)

lmiq commented 4 months ago

No, I was not particularly aware of that. You mean that perhaps the inference issue does not exist if the call is within a function? Can we be sure about that?

Tortar commented 4 months ago

Yes, I think we can

Tortar commented 4 months ago

it will be not inferred if we would had

f(x) = chunks(1:7; n=4, split=x)

though

Tortar commented 4 months ago

Actually it should be type unstable only if x is random inside another function or if x is a parameter of the outermost function

Tortar commented 4 months ago

I checked and indeed the code in #10 given as mwe is fine with your fix

lmiq commented 4 months ago

Oh, that's nice, I also checked now:

julia> function mwe(ichunk=2, n=5)
           return getchunk(1:10, ichunk; n, split=:batch)
       end
mwe (generic function with 4 methods)

julia> @code_warntype mwe()
MethodInstance for mwe()
  from mwe() @ Main REPL[26]:1
Arguments
  #self#::Core.Const(mwe)
Body::UnitRange{Int64}
1 ─ %1 = (#self#)(2, 5)::Core.Const(3:4)
└──      return %1

julia> function mwe(ichunk=2, n=5)
           return getchunk(1:10, ichunk; n, split=:scatter)
       end
mwe (generic function with 4 methods)

julia> @code_warntype mwe()
MethodInstance for mwe()
  from mwe() @ Main REPL[28]:1
Arguments
  #self#::Core.Const(mwe)
Body::StepRange{Int64, Int64}
1 ─ %1 = (#self#)(2, 5)::Core.Const(2:5:7)
└──      return %1

lmiq commented 4 months ago

FWIW, I think that the failure in the inferrence of the Zip object is not really related to the chunking returning unions. This is in the current version, which always returns StepRanges:

(jl_7XlS0M) pkg> st
Status `/tmp/jl_7XlS0M/Project.toml`
  [ae650224] ChunkSplitters v2.4.2

julia> function mwe2(ichunk=2, n=5, l=10)
           cx = getchunk(collect(1:l), ichunk; n, split=:batch)
           cy = getchunk(collect(1:l), ichunk; n, split=:batch)
           return Iterators.zip(cx, cy)
       end
mwe2 (generic function with 4 methods)

julia> @code_warntype mwe2()
MethodInstance for mwe2()
  from mwe2() @ Main REPL[12]:1
Arguments
  #self#::Core.Const(mwe2)
Body::Base.Iterators.Zip
1 ─ %1 = (#self#)(2, 5, 10)::Base.Iterators.Zip
└──      return %1

and this is the same with the completely type-stable new version call:

julia> function mwe2(ichunk=2, n=5, l=10)
           cx = getchunk(collect(1:l), ichunk, BatchSplitter; n)
           cy = getchunk(collect(1:l), ichunk, BatchSplitter; n)
           return Iterators.zip(cx, cy)
       end
mwe2 (generic function with 4 methods)

julia> @code_warntype mwe2()
MethodInstance for mwe2()
  from mwe2() @ Main REPL[51]:1
Arguments
  #self#::Core.Const(mwe2)
Body::Base.Iterators.Zip
1 ─ %1 = (#self#)(2, 5, 10)::Base.Iterators.Zip
└──      return %1

The problem there is that getchunk can return nothing if the array has zero length. But that's a separate issue: https://github.com/JuliaFolds2/ChunkSplitters.jl/issues/36

lmiq commented 4 months ago

So, the possible options are:

Accept this PR as is, and with that we can have the instability in the cases mentioned by Tortar there, which are probably not important.
Export a new interface where split is defined by a mandatory type-parameter (it must be mandatory or the method conflicts with the current one).
Move towards a breaking release where the split type is defined by a type-parameter. (in this case, we could address https://github.com/JuliaFolds2/ChunkSplitters.jl/issues/25 as well.
All of the above.

Tortar commented 4 months ago

I think 3. would be really cool, if a chunk_indices is also provided. But I think 1. should be independent from it. 2. seems not particularly interesting in my opinion.

carstenbauer commented 4 months ago

I agree with @Tortar: I'm also in favour of 3. However, we should finish the (naming) discussion in #25 so that the new API is future-proof.

lmiq commented 4 months ago

Ok, fixed with https://github.com/JuliaFolds2/ChunkSplitters.jl/pull/37.

Let's move to the next one :-)

lmiq commented 4 months ago

Fix released in v2.4.3

JuliaFolds2 / ChunkSplitters.jl

Performance of ChunkSplitters drops because it produces `x:1:y` instead of `x:y` #35