JuliaArrays / UnsafeArrays.jl

Stack-allocated pointer-based array views
Other
42 stars 5 forks source link

Benchmarks with Julia v1.5 #8

Open oschulz opened 4 years ago

oschulz commented 4 years ago

Julia v1.5 enables inline allocation of structs with pointers (JuliaLang/julia#34126), this should make UnsafeArrays unnecessary in most cases. New benchmarks - using the test case

using Base.Threads, LinearAlgebra
using UnsafeArrays
using BenchmarkTools

function colnorms!(dest::AbstractVector, A::AbstractMatrix)
    @threads for i in axes(A, 2)
        dest[i] = norm(view(A, :, i))
    end
    dest
end

A = rand(50, 10^5);
dest = similar(A, size(A, 2));

colnorms!(dest, A)

With Julia v1.4:

julia> nthreads()
64

julia> @benchmark colnorms!(dest, A)
BenchmarkTools.Trial: 
  memory estimate:  4.62 MiB
  allocs estimate:  100323
  --------------
  minimum time:     256.291 μs (0.00% GC)
  median time:      623.428 μs (0.00% GC)
  mean time:        10.020 ms (93.82% GC)
  maximum time:     3.567 s (99.97% GC)
  --------------
  samples:          758
  evals/sample:     1

julia> @benchmark @uviews A colnorms!(dest, A)
BenchmarkTools.Trial: 
  memory estimate:  45.63 KiB
  allocs estimate:  324
  --------------
  minimum time:     227.121 μs (0.00% GC)
  median time:      249.831 μs (0.00% GC)
  mean time:        262.351 μs (1.26% GC)
  maximum time:     4.043 ms (85.49% GC)
  --------------
  samples:          10000
  evals/sample:     1

With Julia v1.5-beta1:

julia> nthreads()
64

julia> @benchmark colnorms!(dest, A)
BenchmarkTools.Trial: 
  memory estimate:  46.61 KiB
  allocs estimate:  321
  --------------
  minimum time:     135.311 μs (0.00% GC)
  median time:      156.681 μs (0.00% GC)
  mean time:        166.511 μs (2.80% GC)
  maximum time:     5.915 ms (89.80% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark @uviews A colnorms!(dest, A)
BenchmarkTools.Trial: 
  memory estimate:  46.66 KiB
  allocs estimate:  322
  --------------
  minimum time:     126.701 μs (0.00% GC)
  median time:      140.041 μs (0.00% GC)
  mean time:        150.547 μs (2.48% GC)
  maximum time:     5.952 ms (90.35% GC)
  --------------
  samples:          10000
  evals/sample:     1

Very little difference in the mean runtime with and without @uviews - in contrast to v1.4, where we see a strong difference. Also, a very nice gain in speed in general.

Test system: AMD EPYC 7702P 64-core CPU.

mbauman commented 4 years ago

I was a little disappointed to still see that 7% difference there, but I was unable to reproduce it with my 6-core skylake laptop:

julia> @benchmark colnorms!($dest, $A)
BenchmarkTools.Trial:
  memory estimate:  4.16 KiB
  allocs estimate:  31
  --------------
  minimum time:     1.138 ms (0.00% GC)
  median time:      1.191 ms (0.00% GC)
  mean time:        1.213 ms (0.00% GC)
  maximum time:     1.754 ms (0.00% GC)
  --------------
  samples:          4119
  evals/sample:     1

julia> @benchmark @uviews $A colnorms!($dest, $A)
BenchmarkTools.Trial:
  memory estimate:  4.16 KiB
  allocs estimate:  31
  --------------
  minimum time:     1.136 ms (0.00% GC)
  median time:      1.187 ms (0.00% GC)
  mean time:        1.213 ms (0.00% GC)
  maximum time:     2.265 ms (0.00% GC)
  --------------
  samples:          4121
  evals/sample:     1

julia> Sys.cpu_summary()
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz:
          speed         user         nice          sys         idle          irq
#1-12  2600 MHz   27733057 s          0 s   14434951 s  900004048 s          0 s

julia> nthreads()
6

Is that 7% in your demo just noise? I'd be very interested to know if it's real.

oschulz commented 4 years ago

Is that 7% in your demo just noise? I'd be very interested to know if it's real.

I think so, I'll make a few more in-depth check with thread-pinning, etc.

oschulz commented 4 years ago

@mbauman, ran it again with $ in @benchmark. There doesn't seem to be any very significant difference between using @uviews or not on Julia v1.5 :smile: :

numactl -C 0-63 julia
using Base.Threads, LinearAlgebra
using UnsafeArrays
using BenchmarkTools

function colnorms!(dest::AbstractVector, A::AbstractMatrix)
    @threads for i in axes(A, 2)
        dest[i] = norm(view(A, :, i))
    end
    dest
end

colnorms_with_uviews!(dest, A) = @uviews A colnorms!(dest, A)

A = rand(50, 10^5);
dest = similar(A, size(A, 2));

colnorms!(dest, A)
colnorms_with_uviews!(dest, A)

julia> versioninfo()
Julia Version 1.5.0-beta1.0
Commit 6443f6c95a (2020-05-28 17:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7702P 64-Core Processor

julia> nthreads()
64

julia> @benchmark colnorms!($dest, $A)
BenchmarkTools.Trial: 
  memory estimate:  46.61 KiB
  allocs estimate:  321
  --------------
  minimum time:     95.621 μs (0.00% GC)
  median time:      110.751 μs (0.00% GC)
  mean time:        121.822 μs (2.68% GC)
  maximum time:     4.075 ms (91.13% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark colnorms_with_uviews!($dest, $A)
BenchmarkTools.Trial: 
  memory estimate:  46.63 KiB
  allocs estimate:  321
  --------------
  minimum time:     89.120 μs (0.00% GC)
  median time:      105.310 μs (0.00% GC)
  mean time:        116.689 μs (2.70% GC)
  maximum time:     4.001 ms (90.93% GC)
  --------------
  samples:          10000
  evals/sample:     1

These numbers seem fairly stable, when I run it multiple times. So a very small difference remains, but that can't really be due to memory allocation - on 64 threads, any difference in memory allocation frequency should result in a clear performance difference.

oschulz commented 4 years ago

And the pure absolute difference between Julia v1.4 and v1.5 (without UnsafeArrays), on 64 threads:

Julia v1.4:

julia> @benchmark colnorms!($dest, $A)
BenchmarkTools.Trial: 
  memory estimate:  4.62 MiB
  allocs estimate:  100323
  --------------
  minimum time:     257.731 μs (0.00% GC)
  median time:      617.504 μs (0.00% GC)
  mean time:        9.535 ms (93.55% GC)
  maximum time:     3.384 s (99.97% GC)
  --------------
  samples:          758
  evals/sample:     1

Julia v1.5

julia> @benchmark colnorms!($dest, $A)
BenchmarkTools.Trial: 
  memory estimate:  46.61 KiB
  allocs estimate:  321
  --------------
  minimum time:     95.621 μs (0.00% GC)
  median time:      110.751 μs (0.00% GC)
  mean time:        121.822 μs (2.68% GC)
  maximum time:     4.075 ms (91.13% GC)
  --------------
  samples:          10000
  evals/sample:     1

A mean time of 122 μs vs. 9.5 ms before! My deepest thanks to the compiler team for this. I think JuliaLang/julia#34126 will boost heavily multi-threaded applications a lot.

After all, benchmark mean time is usually the number with the strongest influence on application wall-clock time.