ParRes / Kernels

This is a set of simple programs that can be used to explore the features of a parallel platform.
https://groups.google.com/forum/#!forum/parallel-research-kernels
Other
409 stars 107 forks source link

Small Julia fixes #555

Closed miguelraz closed 3 years ago

miguelraz commented 3 years ago

Do you certify that your contribution is made in good faith and does not attempt to introduce any negative behavior into this project?

I just saw this thread on twitter, so I might as well chip in a bit.

Changes:

  1. do_add -> do_add! : When a function mutates its arguments (A in this case), it is Julia convention to add a ! at the end of the function name to alert the user.
  2. Switching the loop order - Julia is column major, and this incurs a heavy performance hit.
  3. 1:n -> Base.OneTo(n) This method uses a type to tell the compiler that the loops start at 1, which is very useful for loop unrolling and SIMD shenanigans. Usually not worth it if it's not a super tight loop.
  4. @inbounds elides bounds checking.
  5. 1.0 to one(eltype(A)): This calls a function which determines what is the identity element for the collection of A. This is especially useful to guarantee type stability of Julia code. This is especially relevant when doing floating point operations since 1.0 is interpreted as a Float64, and therefore all the floats get widened to Float64, even if we passed in a Float32 array. The timings below should make the difference very clear.
# Alternative broadcasting syntax - not fastest, but compact and nifty
julia> @btime A .+= 1.0;
   671.889 μs (2 allocations: 64 bytes)

julia> function do_add(A, n)
           for i=1:n
               for j=1:n
                   A[i,j] += 1.0
               end
           end
       end
do_add (generic function with 1 method)

julia> A = [rand() for i in 1:1000, j in 1:1000];

julia> Afloat32 = [rand(Float32) for i in 1:1000, j in 1:1000];

julia> @btime do_add($A, 1000);
  2.720 ms (0 allocations: 0 bytes)

# We should be getting a much larger perf boost from Float32's
julia> @btime do_add($Afloat32, 1000);
  1.970 ms (0 allocations: 0 bytes)

# Switching the loops
julia> function do_add2!(A, n)
            for j=1:n
                for i=1:n
                    A[i,j] += 1.0
                end
            end
        end
 do_add2! (generic function with 1 method)

# About a 3-4x gain
 julia> @btime do_add2!($A, 1000);
   678.599 μs (0 allocations: 0 bytes)

 julia> function do_add3!(A, n)
            for j=1:n
                for i=1:n
                    @inbounds A[i,j] += 1.0
                end
            end
        end
 do_add3! (generic function with 1 method)

 julia> @btime do_add3!($A, 1000);
   605.119 μs (0 allocations: 0 bytes)

# This is equivalent to the one in the PR
julia> function do_add4!(A)
           n = size(A, 1)
           for j = Base.OneTo(n)
               for i = Base.OneTo(n)
                   @inbounds A[i,j] += one(eltype(A))
               end
           end
       end
do_add4! (generic function with 1 method)

julia> @btime do_add4!($A);
  499.562 μs (0 allocations: 0 bytes)

# PR version
julia> @btime do_add4!($Afloat32);
  181.395 μs (0 allocations: 0 bytes)

TL;DR - I highly recommend reading the performance tips of the manual, almost nothing I did here is not written there. We gained about 15x and made our code type stable and generic - you can probably get this to run on GPUs without too much additional effort, but that's a PR for a different day.

My laptop specs:

julia> versioninfo()
Julia Version 1.6.0-rc2
Commit 4b6b9fe4d7 (2021-03-11 07:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, haswell)
miguelraz commented 3 years ago

After posting some questions in the Julia Slack #linearalgebra chat, Mason Protter and Chris Elrod kindly suggested some more idiomatic improvements, and the use of LoopVectorization.jl for AVX + multithreaded speedups.

@chriselrod's code:

]add LoopVectorization@0.12

julia> using LoopVectorization

julia> @btime @avxt $Afloat32 .+= 1.0;
  5.955 μs (0 allocations: 0 bytes)

julia> @btime @avxt $Afloat32 .+= 1f0;
  6.063 μs (0 allocations: 0 bytes)

julia> @btime @avx thread=4 $Afloat32 .+= 1f0;
  97.759 μs (0 allocations: 0 bytes)

julia> @btime @avx thread=false $Afloat32 .+= 1f0;
  121.804 μs (0 allocations: 0 bytes)

With the following spec:

julia> versioninfo()
Julia Version 1.7.0-DEV.707
Commit d1d0646390* (2021-03-14 18:11 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36
jeffhammond commented 3 years ago

Thanks for the contributions. I am overdue to improve Julia and am very happy to have more experienced people help with it.

This PR doesn't compile for me. Can you fix it?

JULIA % julia stencil.jl 10 4000 star 4
Parallel Research Kernels version 
Julia stencil execution on 2D grid
Grid size            = 4000
Radius of stencil    = 4
Type of stencil      = star
Data type            = double precision
Compact representation of stencil loop body
Number of iterations = 10
ERROR: LoadError: MethodError: no method matching +(::Float64, ::Matrix{Float64})
For element-wise addition, use broadcasting with dot syntax: scalar .+ array
Closest candidates are:
  +(::Any, ::Any, ::Any, ::Any...) at operators.jl:560
  +(::Union{Float16, Float32, Float64}, ::BigFloat) at mpfr.jl:392
  +(::Array, ::Array...) at arraymath.jl:43
  ...
Stacktrace:
 [1] do_add!
   @ ~/Work/PRK/JULIA/stencil.jl:67 [inlined]
 [2] main()
   @ Main ~/Work/PRK/JULIA/stencil.jl:212
 [3] top-level scope
   @ ~/Work/PRK/JULIA/stencil.jl:246
miguelraz commented 3 years ago

Ah, tried to be too cheeky with a function call. That should do it.