Open oschulz opened 4 years ago
I was a little disappointed to still see that 7% difference there, but I was unable to reproduce it with my 6-core skylake laptop:
julia> @benchmark colnorms!($dest, $A)
BenchmarkTools.Trial:
memory estimate: 4.16 KiB
allocs estimate: 31
--------------
minimum time: 1.138 ms (0.00% GC)
median time: 1.191 ms (0.00% GC)
mean time: 1.213 ms (0.00% GC)
maximum time: 1.754 ms (0.00% GC)
--------------
samples: 4119
evals/sample: 1
julia> @benchmark @uviews $A colnorms!($dest, $A)
BenchmarkTools.Trial:
memory estimate: 4.16 KiB
allocs estimate: 31
--------------
minimum time: 1.136 ms (0.00% GC)
median time: 1.187 ms (0.00% GC)
mean time: 1.213 ms (0.00% GC)
maximum time: 2.265 ms (0.00% GC)
--------------
samples: 4121
evals/sample: 1
julia> Sys.cpu_summary()
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz:
speed user nice sys idle irq
#1-12 2600 MHz 27733057 s 0 s 14434951 s 900004048 s 0 s
julia> nthreads()
6
Is that 7% in your demo just noise? I'd be very interested to know if it's real.
Is that 7% in your demo just noise? I'd be very interested to know if it's real.
I think so, I'll make a few more in-depth check with thread-pinning, etc.
@mbauman, ran it again with $
in @benchmark
. There doesn't seem to be any very significant difference between using @uviews
or not on Julia v1.5 :smile: :
numactl -C 0-63 julia
using Base.Threads, LinearAlgebra
using UnsafeArrays
using BenchmarkTools
function colnorms!(dest::AbstractVector, A::AbstractMatrix)
@threads for i in axes(A, 2)
dest[i] = norm(view(A, :, i))
end
dest
end
colnorms_with_uviews!(dest, A) = @uviews A colnorms!(dest, A)
A = rand(50, 10^5);
dest = similar(A, size(A, 2));
colnorms!(dest, A)
colnorms_with_uviews!(dest, A)
julia> versioninfo()
Julia Version 1.5.0-beta1.0
Commit 6443f6c95a (2020-05-28 17:42 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD EPYC 7702P 64-Core Processor
julia> nthreads()
64
julia> @benchmark colnorms!($dest, $A)
BenchmarkTools.Trial:
memory estimate: 46.61 KiB
allocs estimate: 321
--------------
minimum time: 95.621 μs (0.00% GC)
median time: 110.751 μs (0.00% GC)
mean time: 121.822 μs (2.68% GC)
maximum time: 4.075 ms (91.13% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark colnorms_with_uviews!($dest, $A)
BenchmarkTools.Trial:
memory estimate: 46.63 KiB
allocs estimate: 321
--------------
minimum time: 89.120 μs (0.00% GC)
median time: 105.310 μs (0.00% GC)
mean time: 116.689 μs (2.70% GC)
maximum time: 4.001 ms (90.93% GC)
--------------
samples: 10000
evals/sample: 1
These numbers seem fairly stable, when I run it multiple times. So a very small difference remains, but that can't really be due to memory allocation - on 64 threads, any difference in memory allocation frequency should result in a clear performance difference.
And the pure absolute difference between Julia v1.4 and v1.5 (without UnsafeArrays), on 64 threads:
Julia v1.4:
julia> @benchmark colnorms!($dest, $A)
BenchmarkTools.Trial:
memory estimate: 4.62 MiB
allocs estimate: 100323
--------------
minimum time: 257.731 μs (0.00% GC)
median time: 617.504 μs (0.00% GC)
mean time: 9.535 ms (93.55% GC)
maximum time: 3.384 s (99.97% GC)
--------------
samples: 758
evals/sample: 1
Julia v1.5
julia> @benchmark colnorms!($dest, $A)
BenchmarkTools.Trial:
memory estimate: 46.61 KiB
allocs estimate: 321
--------------
minimum time: 95.621 μs (0.00% GC)
median time: 110.751 μs (0.00% GC)
mean time: 121.822 μs (2.68% GC)
maximum time: 4.075 ms (91.13% GC)
--------------
samples: 10000
evals/sample: 1
A mean time of 122 μs vs. 9.5 ms before! My deepest thanks to the compiler team for this. I think JuliaLang/julia#34126 will boost heavily multi-threaded applications a lot.
After all, benchmark mean time is usually the number with the strongest influence on application wall-clock time.
Julia v1.5 enables inline allocation of structs with pointers (JuliaLang/julia#34126), this should make
UnsafeArrays
unnecessary in most cases. New benchmarks - using the test caseWith Julia v1.4:
With Julia v1.5-beta1:
Very little difference in the mean runtime with and without
@uviews
- in contrast to v1.4, where we see a strong difference. Also, a very nice gain in speed in general.Test system: AMD EPYC 7702P 64-core CPU.