Open johnnychen94 opened 3 years ago
This is the benchmark result on array of 4096 elements at various dimensions.
OffsetArrays 1.3.1 is used as a comparison, which uses IdOffsetRange
as its axes type. It is very interesting that OffsetArray
with Julia 1.6.0-DEV gets relatively the best performance
Interesting. So basically everything in 1.6 is no worse than 1.5, with the minor exception of Array in 6 and 7 dimensions?
I read https://github.com/JuliaLang/julia/issues/38086#issuecomment-715495220 and wondered if there was something going on with the bounds check (or inbounds propagation).
function arr_sum_both(X)
val = zero(eltype(X))
R = CartesianIndices(X)
@inbounds for i in R
@inbounds val += X[i]
end
val
end
function arr_sum_outeronly(X)
val = zero(eltype(X))
R = CartesianIndices(X)
@inbounds for i in R
val += X[i]
end
val
end
julia> VERSION
v"1.6.0-DEV.1322"
julia> @btime arr_sum($X);
5.033 μs (0 allocations: 0 bytes)
julia> @btime arr_sum_both($X);
5.033 μs (0 allocations: 0 bytes)
julia> @btime arr_sum_outeronly($X);
5.267 μs (0 allocations: 0 bytes)
This may be a separate issue, but it is weird that arr_sum_outeronly
is slower.
This is what I get now with two repeated benchmarks:
julia> VERSION
v"1.7.0-DEV.36"
julia> X = rand(4, 4, 4, 4, 4, 4);
julia> @btime arr_sum($X);
5.540 μs (0 allocations: 0 bytes)
5.538 μs (0 allocations: 0 bytes)
julia> @btime arr_sum_both($X);
5.221 μs (0 allocations: 0 bytes)
5.582 μs (0 allocations: 0 bytes)
julia> @btime arr_sum_outeronly($X);
5.110 μs (0 allocations: 0 bytes)
5.223 μs (0 allocations: 0 bytes)
This difference might just be noises.
I also checked again and found no difference in the native code depending on the position of @inbounds
.
However, it seems certain that the noise is not random. The internal state of the CPU, such as the caches, may affect it in this case.
On the other hand, in the case of https://github.com/JuliaLang/julia/issues/38086#issuecomment-715495220, there is a clear difference.
Seems like this caused a similar problem in https://discourse.julialang.org/t/drop-of-performances-with-julia-1-6-0-for-interpolationkernels/58085 as was fixed in https://github.com/JuliaLang/julia/pull/39333.
https://discourse.julialang.org/t/drop-of-performances-with-julia-1-6-0-for-interpolationkernels/58085/12 has an MWE.
Adding @inbouds @simd
outside the Cartesian loop seems to work arund it.
IIRC, this might have something to do with https://github.com/JuliaLang/julia/issues/39700#issuecomment-781030776
However, I have no knowledge about the countermeasures on the compiler side.
It turns out that #37829 has increased iteration performance for 2d array, while slowed down the iteration for higher-dimensional(>=4) array...
SIMD and LinearIndices are not affected.
simd
```julia julia> using BenchmarkTools julia> function arr_sum_simd(X) val = zero(eltype(X)) R = CartesianIndices(X) @simd for i in R @inbounds val += X[i] end val end arr_sum_simd (generic function with 1 method) julia> X = rand(4, 4, 4, 4, 4, 4); julia> @btime arr_sum_simd($X) 3.593 μs (0 allocations: 0 bytes) # 1.6.0-DEV.1262 3.827 μs (0 allocations: 0 bytes) # 1.5.2 3.585 μs (0 allocations: 0 bytes) # 1.0.5 ```LinearIndices
```julia julia> using BenchmarkTools julia> function arr_sum_linear(X) val = zero(eltype(X)) R = LinearIndices(X) for i in R @inbounds val += X[i] end val end arr_sum_linear (generic function with 1 method) julia> X = rand(4, 4, 4, 4, 4, 4); julia> @btime arr_sum_linear($X) 3.707 μs (0 allocations: 0 bytes) # 1.6.0-DEV.1262 3.626 μs (0 allocations: 0 bytes) # 1.5.2 3.796 μs (0 allocations: 0 bytes) # 1.0.5 ```