Operations on `views` much slower when indices are `start:1:stop`

Tortar commented 3 weeks ago

MWE:

julia> using StreamSampling

julia> a = collect(1:10^7);

julia> view_a = @view(a[1:length(a)]);

julia> view_step_a = @view(a[1:1:length(a)]);

julia> using BenchmarkTools

julia> @benchmark itsample($a, 1)
BenchmarkTools.Trial: 4235 samples with 1 evaluation.
 Range (min … max):  1.252 μs …   3.169 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.168 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.177 ms ± 684.714 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▅▅▅▇▄▆▆▅▆▆▆▅▅▇▅▄▅▇▇▅█▄▇▃▇▅▇▅▃█▄▇▄▇▄▇▂▅▅▆▇▆▅▇▆▃▆▅▃▅▅▆▃▅▆▅▄▅  
  ███████████████████████████████████████████████████████████ █
  1.25 μs         Histogram: frequency by time        2.36 ms <

 Memory estimate: 144 bytes, allocs estimate: 3.

julia> @benchmark itsample($view_a, 1)
BenchmarkTools.Trial: 2856 samples with 1 evaluation.
 Range (min … max):  2.766 μs … 3.574 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.739 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.750 ms ± 1.033 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▄▄▅▇▅▆▄▃▆▄▅▆▂█▄▅▆▄ ▅▆▅▅▄▆▄▅▅▆▄▆▄▅▆▃▃▅▄▂▄▂▃▅▂▄▄▅▄▅▄▆▅▂█▅  
  █████████████████████████████████████████████████████████ █
  2.77 μs        Histogram: frequency by time       3.51 ms <

 Memory estimate: 144 bytes, allocs estimate: 3.

julia> @benchmark itsample($view_step_a, 1)
BenchmarkTools.Trial: 411 samples with 1 evaluation.
 Range (min … max):  57.179 μs … 24.163 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.177 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.215 ms ±  7.123 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂   ▁▁ ▁   █   ▃▂▂      ▅▁ ▁     ▁▁      ▁ ▁  ▂ ▁▂ ▁▄▁▁     
  ▇███▇██▃█▅▅▄███▇████▄▅▄▆▆██▄█▆█▆█▅███▇▆▇▆██▄█▆█████▄████▆██ ▅
  57.2 μs         Histogram: frequency by time        23.9 ms <

 Memory estimate: 144 bytes, allocs estimate: 3.

This depends on package code, but internally nothing changes if one passes a view or a vector. Indeed when the step in the view is not provided the slowdown is much less pronounced and maybe expected. But it seems strange to me that instead a view with a provided step gives a 10x slowdown.

oscardssmith commented 3 weeks ago

I don't think this is especially surprising. Memory locality matters a lot for performance.

Tortar commented 3 weeks ago

But can at least x:1:y be as fast as x:y? I actually found that some threaded code was slower than single threaded code because ChunkSplitters.jl produced the range with the step. But probably this can be improved also from the ChunkSplitters.jl side

Tortar commented 3 weeks ago

I think it is better there

Zentrik commented 3 weeks ago

Looks like the majority of the time is spent in https://github.com/JuliaDynamics/StreamSampling.jl/blob/1d56ec85c2842bba1f800feb48ab978d6db8eb6c/src/SortedSamplingMulti.jl#L46 called from https://github.com/JuliaDynamics/StreamSampling.jl/blob/1d56ec85c2842bba1f800feb48ab978d6db8eb6c/src/SortedSamplingMulti.jl#L23.

Using the step range seems to be slower due to checkbounds being much slower due to length being slow, https://github.com/JuliaLang/julia/blob/48d4fd48430af58502699fdf3504b90589df3852/base/range.jl#L782 The performance difference is also present on nightly.

Adding a fast path for a step range with a step of 1 in length triples the speed for me but it's still not as fast as a UnitRange (10ms vs 2ms). Not sure much else can be done on the Julia side. I suspect the code in StreamSampling could be made faster to avoid this issue.

EDIT: If length could inline (I don't think adding @inline would work as then wherever length is called would just fail to inline) I think we would see more comparable performance. Right now it doesn't because the inliner counts the cost of the div in each of the branches separately. That seems odd to me but I presume it's intentional.

nsajko commented 3 weeks ago

Perhaps deduplicating the div from the last two branches could help.

https://github.com/JuliaLang/julia/blob/48d4fd48430af58502699fdf3504b90589df3852/base/range.jl#L789-L795

Something like

D = typeof(diff)
a = if s isa Unsigned || -1 <= s <= 1 || s == -s
    div(diff, s) % D
else
    let b = (s < 0) ? (unsigned(-diff), -s) : (unsigned(diff), s)
        div(b...) % D
    end
end

jishnub commented 3 weeks ago

Does the performance improve if the step size is chosen to be static(1) using Static.jl?

Zentrik commented 3 weeks ago

Using static(1) for the step helps a lot, now I get 3.5ms compared to 2ms for view_a.

Changing the inliner to only count the most expensive branch didn't seem to make any difference. EDIT: Changing the inliner (https://github.com/Zentrik/julia/commit/ce3e033f62099adeb2d0b784154b610fca215df0) and adding a fast path to length completely fixes the problem.

JuliaLang / julia

Operations on `views` much slower when indices are `start:1:stop` #54818