Closed Tortar closed 4 months ago
The reason for returning the range with the step is because the step might be different from 1 if the scatter
option is chosen. And if we return different range types in each case we introduce a type instability.
We could address that, dispatching on chunk types. But do you have an example where that clearly affects performance?
(I'm not sure, but the necessary changes might be breaking)
This is it starting julia with julia --threads=2
:
using BenchmarkTools
using ChunkSplitters
using StreamSampling
a = collect(1:10^7)
c1 = chunks(a; n = 2)
c2 = [(1, 1:5000000), (2, 5000001:10000000)]
function mwe_c1(a, c)
s = Vector{Vector{Int}}(undef, 2)
Threads.@threads for (i, v) in enumerate(c)
s[i] = itsample(@view(a[v]), 1)
end
return s
end
function mwe_c2(a, c)
s = Vector{Vector{Int}}(undef, 2)
Threads.@threads for (i, v) in c
s[i] = itsample(@view(a[v]), 1)
end
return s
end
@benchmark mwe_c1($a, $c1)
@benchmark mwe_c2($a, $c2)
I see:
julia> @benchmark mwe_c1($a, $c1)
BenchmarkTools.Trial: 591 samples with 1 evaluation.
Range (min … max): 362.941 μs … 12.869 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.862 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.456 ms ± 2.854 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▄ ▁▂ ▁▂▄▁▂▂ ▁▄ ▆ █▄▇▂▂▄
▃▃▃▃▃▃▁▄▅▁▃▃▅▅▄▅▇▆▆▅▅▅▆▄█▅▄▆████▆██▆█████████████▆█████████▇ ▅
363 μs Histogram: frequency by time 12.5 ms <
Memory estimate: 1.47 KiB, allocs estimate: 18.
julia> @benchmark mwe_c2($a, $c2)
BenchmarkTools.Trial: 4051 samples with 1 evaluation.
Range (min … max): 46.218 μs … 2.398 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.306 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.231 ms ± 431.481 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▂▁▁▂▂▂▃▃▄▁▄▃▃▄▆▄▇▅▄▆▇█▇▇▇██▇▆▂
▂▁▂▂▂▃▃▄▄▄▄▄▅▅▄▅▆▆▅▇▇█▇▇▇▇████████████████████████████████▆▃ ▆
46.2 μs Histogram: frequency by time 1.89 ms <
Memory estimate: 1.44 KiB, allocs estimate: 18.
I made an attempt to see if union splitting could solve this issue without introducing instabilities. The attempt is in this branch:
https://github.com/JuliaFolds2/ChunkSplitters.jl/tree/UnitRange_for_batch
This version solves the performance issue reported here, because it returns an UnitRange{Int}
if split=:batch
and a StepRange{Int,Int}
if split=:scatter
.
Now I have a upper user-facing function, which converts the split::Symbol
parameter into a type parameter and dispatches with that all the way.
Nevertheless, this introduces an inference problem (a potential type-instability, although marked in "yellow" in @code_warntype
), that manifests itself whenever we try to set the split
parameter:
julia> using ChunkSplitters, Test
julia> @inferred chunks(1:7; n=4, split=:batch)
ERROR: return type ChunkSplitters.Chunk{UnitRange{Int64}, ChunkSplitters.BatchSplitter, ChunkSplitters.FixedCount} does not match inferred return type Union{ChunkSplitters.Chunk{UnitRange{Int64}, ChunkSplitters.BatchSplitter, ChunkSplitters.FixedCount}, ChunkSplitters.Chunk{UnitRange{Int64}, ChunkSplitters.ScatterSplitter, ChunkSplitters.FixedCount}}
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] top-level scope
@ REPL[9]:1
And this brings back this issue: https://github.com/JuliaFolds2/ChunkSplitters.jl/issues/10
Which is where I decided to return the same type of range in all cases.
I don't know if there is a better solution. I'm more afraid of propagating a type-instability to other people's code than of this performance issue specifically. The Base.@constprop :agressive
option suggested in the other issue does not seem to solve the issue here. But since ChunkSplitters
is used behind OhMyThreads
, many people might experience this performance loss without noticing.
I'm not sure if the difference in performance between using the different types of ranges in the case of step == 1
should not something to be dealt upstream in fact.
@carstenbauer and @MasonProtter , do you want to chime in?
FWIW, the key here is that inference related to the split
keyword parameter does not propagate through this function:
Down there, everything is fine.
This is the issue, in a much simpler case:
julia> function f(x; y::Symbol)
if y == :one
return 1
elseif y == :two
return 2.0
else
error()
end
end
f (generic function with 1 method)
julia> @inferred f(1; y=:two)
ERROR: return type Float64 does not match inferred return type Union{Float64, Int64}
We could change the API, to be:
julia> @inferred chunks(1:7, BatchSplitter; n=4)
ChunkSplitters.Chunk{UnitRange{Int64}, BatchSplitter, ChunkSplitters.FixedCount}(1:7, 4, 0)
julia> @inferred chunks(1:7, ScatterSplitter; n=4)
ChunkSplitters.Chunk{UnitRange{Int64}, ScatterSplitter, ChunkSplitters.FixedCount}(1:7, 4, 0)
That would require just removing the method that accepts split
as a Symbol
. Of course, all breaking. BatchSplitter
would be the default (positional) parameter.
I think you know but I would notice that
f() = chunks(1:7; n=4, split=:batch)
works okay.
julia> @inferred f()
ChunkSplitters.Chunk{UnitRange{Int64}, ChunkSplitters.BatchSplitter, ChunkSplitters.FixedCount}(1:7, 4, 0)
No, I was not particularly aware of that. You mean that perhaps the inference issue does not exist if the call is within a function? Can we be sure about that?
Yes, I think we can
it will be not inferred if we would had
f(x) = chunks(1:7; n=4, split=x)
though
Actually it should be type unstable only if x
is random inside another function or if x
is a parameter of the outermost function
I checked and indeed the code in #10 given as mwe is fine with your fix
Oh, that's nice, I also checked now:
julia> function mwe(ichunk=2, n=5)
return getchunk(1:10, ichunk; n, split=:batch)
end
mwe (generic function with 4 methods)
julia> @code_warntype mwe()
MethodInstance for mwe()
from mwe() @ Main REPL[26]:1
Arguments
#self#::Core.Const(mwe)
Body::UnitRange{Int64}
1 ─ %1 = (#self#)(2, 5)::Core.Const(3:4)
└── return %1
julia> function mwe(ichunk=2, n=5)
return getchunk(1:10, ichunk; n, split=:scatter)
end
mwe (generic function with 4 methods)
julia> @code_warntype mwe()
MethodInstance for mwe()
from mwe() @ Main REPL[28]:1
Arguments
#self#::Core.Const(mwe)
Body::StepRange{Int64, Int64}
1 ─ %1 = (#self#)(2, 5)::Core.Const(2:5:7)
└── return %1
FWIW, I think that the failure in the inferrence of the Zip
object is not really related to the chunking returning unions. This is in the current version, which always returns StepRange
s:
(jl_7XlS0M) pkg> st
Status `/tmp/jl_7XlS0M/Project.toml`
[ae650224] ChunkSplitters v2.4.2
julia> function mwe2(ichunk=2, n=5, l=10)
cx = getchunk(collect(1:l), ichunk; n, split=:batch)
cy = getchunk(collect(1:l), ichunk; n, split=:batch)
return Iterators.zip(cx, cy)
end
mwe2 (generic function with 4 methods)
julia> @code_warntype mwe2()
MethodInstance for mwe2()
from mwe2() @ Main REPL[12]:1
Arguments
#self#::Core.Const(mwe2)
Body::Base.Iterators.Zip
1 ─ %1 = (#self#)(2, 5, 10)::Base.Iterators.Zip
└── return %1
and this is the same with the completely type-stable new version call:
julia> function mwe2(ichunk=2, n=5, l=10)
cx = getchunk(collect(1:l), ichunk, BatchSplitter; n)
cy = getchunk(collect(1:l), ichunk, BatchSplitter; n)
return Iterators.zip(cx, cy)
end
mwe2 (generic function with 4 methods)
julia> @code_warntype mwe2()
MethodInstance for mwe2()
from mwe2() @ Main REPL[51]:1
Arguments
#self#::Core.Const(mwe2)
Body::Base.Iterators.Zip
1 ─ %1 = (#self#)(2, 5, 10)::Base.Iterators.Zip
└── return %1
The problem there is that getchunk
can return nothing
if the array has zero length. But that's a separate issue: https://github.com/JuliaFolds2/ChunkSplitters.jl/issues/36
So, the possible options are:
I think 3. would be really cool, if a chunk_indices
is also provided. But I think 1. should be independent from it. 2. seems not particularly interesting in my opinion.
I agree with @Tortar: I'm also in favour of 3. However, we should finish the (naming) discussion in #25 so that the new API is future-proof.
Ok, fixed with https://github.com/JuliaFolds2/ChunkSplitters.jl/pull/37.
Let's move to the next one :-)
Fix released in v2.4.3
I initially described the issue here: https://github.com/JuliaLang/julia/issues/54818. But I actually found this bottleneck in ChunkSplitter.jl when operating in multi-threading on views. Now:
I think instead it would be much faster in some cases if it returned