Open johnnychen94 opened 4 years ago
For the record I obtain the same run-times
julia> @btime arr_sum($X);
6.528 μs (0 allocations: 0 bytes)
julia> @btime arr_sum($XO);
6.535 μs (0 allocations: 0 bytes)
This is odd, as indexing is definitely faster with Array
s
julia> @btime getindex($X, 1, 1, 1, 1, 1, 1);
3.768 ns (0 allocations: 0 bytes)
julia> @btime getindex($XO, 3, 2, 1, 2, 3, 4);
14.857 ns (0 allocations: 0 bytes)
julia> getindex_inbounds(X, inds...) = @inbounds X[inds...];
julia> @btime getindex_inbounds($X, 1, 1, 1, 1, 1, 1);
2.694 ns (0 allocations: 0 bytes)
julia> @btime getindex_inbounds($XO, 3, 2, 1, 2, 3, 4);
4.960 ns (0 allocations: 0 bytes)
For the record I obtain the same run-times
Sorry I wasn't aware there's a performance regression in julia 1.6-dev, see also https://github.com/JuliaLang/julia/issues/38073
Is the "raw" indexing time relevant? Any performance-sensitive operation is presumably in a loop, and in a loop won't the compiler hoist all the slow parts out of the loop?
As evidence, if I define arr_sum_bc
identically to arr_sum
but without the @inbounds
, here's what I get:
julia> @btime arr_sum($X)
5.992 μs (0 allocations: 0 bytes)
2069.986834289989
julia> @btime arr_sum($XO)
4.093 μs (0 allocations: 0 bytes)
2069.986834289989
julia> @btime arr_sum_bc($X)
7.791 μs (0 allocations: 0 bytes)
2069.986834289989
julia> @btime arr_sum_bc($XO)
5.022 μs (0 allocations: 0 bytes)
2069.986834289989
Despite the fact that an isolated getindex
is slower, in the context of a loop the slow parts are done just once so the performance does not suffer.
Bad inlining could nix this, but since we've annotated with @inline
I doubt that will be a problem for most uses.
I don't understand why OffsetArrays are faster than Arrays, though.
I'm recently building up some cache array with OffsetArrays and realized the performance bottleneck becomes
getindex(::OffsetArray, I)
.The benchmark result looks interesting; unsure why
arr_sum
runs faster on OffsetArray 🤔 Any ideas?The default
checkbounds
implementation definitely takes too long here. I believe the additional time is spent on the construction ofIdOffsetRange
and its generic and thus slowergetindex
.These might be benchmark artifacts, though.