Open Tortar opened 3 weeks ago
I don't think this is especially surprising. Memory locality matters a lot for performance.
But can at least x:1:y
be as fast as x:y
? I actually found that some threaded code was slower than single threaded code because ChunkSplitters.jl
produced the range with the step. But probably this can be improved also from the ChunkSplitters.jl
side
I think it is better there
Looks like the majority of the time is spent in https://github.com/JuliaDynamics/StreamSampling.jl/blob/1d56ec85c2842bba1f800feb48ab978d6db8eb6c/src/SortedSamplingMulti.jl#L46 called from https://github.com/JuliaDynamics/StreamSampling.jl/blob/1d56ec85c2842bba1f800feb48ab978d6db8eb6c/src/SortedSamplingMulti.jl#L23.
Using the step range seems to be slower due to checkbounds
being much slower due to length
being slow, https://github.com/JuliaLang/julia/blob/48d4fd48430af58502699fdf3504b90589df3852/base/range.jl#L782
The performance difference is also present on nightly.
Adding a fast path for a step range with a step of 1 in length
triples the speed for me but it's still not as fast as a UnitRange
(10ms vs 2ms). Not sure much else can be done on the Julia side. I suspect the code in StreamSampling could be made faster to avoid this issue.
EDIT: If length
could inline (I don't think adding @inline
would work as then wherever length
is called would just fail to inline) I think we would see more comparable performance. Right now it doesn't because the inliner counts the cost of the div
in each of the branches separately. That seems odd to me but I presume it's intentional.
Perhaps deduplicating the div
from the last two branches could help.
Something like
D = typeof(diff)
a = if s isa Unsigned || -1 <= s <= 1 || s == -s
div(diff, s) % D
else
let b = (s < 0) ? (unsigned(-diff), -s) : (unsigned(diff), s)
div(b...) % D
end
end
Does the performance improve if the step size is chosen to be static(1)
using Static.jl
?
Using static(1)
for the step helps a lot, now I get 3.5ms compared to 2ms for view_a.
Changing the inliner to only count the most expensive branch didn't seem to make any difference.
EDIT: Changing the inliner (https://github.com/Zentrik/julia/commit/ce3e033f62099adeb2d0b784154b610fca215df0) and adding a fast path to length
completely fixes the problem.
MWE:
This depends on package code, but internally nothing changes if one passes a view or a vector. Indeed when the step in the view is not provided the slowdown is much less pronounced and maybe expected. But it seems strange to me that instead a view with a provided step gives a 10x slowdown.