ParRes / Kernels

This is a set of simple programs that can be used to explore the features of a parallel platform.
https://groups.google.com/forum/#!forum/parallel-research-kernels
Other
405 stars 107 forks source link

Use LoopVectorization in julia stencil / transpose #543

Open haampie opened 3 years ago

haampie commented 3 years ago

What type of issue is this?

LoopVectorization.jl usually does a better job than the julia compiler + llvm at unrolling and vectorization. You might want to use it for some of the benchmarks.

For instance on zen2:

$ ~/julia-1.6.0-rc1/bin/julia -O3

(@v1.6) pkg> activate --temp
  Activating new environment at `/tmp/jl_GYGsu9/Project.toml`

(jl_GYGsu9) pkg> add BenchmarkTools, LoopVectorization

julia> using LoopVectorization, BenchmarkTools

julia> r = 3;

julia> n = 1000;

julia> A = zeros(Float64, n, n);

julia> B = zeros(Float64, n, n);

julia> W = zeros(Float64, 2*r+1, 2*r+1);

julia> function do_stencil(A, W, B, r, n)
           for j=r:n-r-1
               for i=r:n-r-1
                   for jj=-r:r
                       for ii=-r:r
                           @inbounds B[i+1,j+1] += W[r+ii+1,r+jj+1] * A[i+ii+1,j+jj+1]
                       end
                   end
               end
           end
       end
do_stencil (generic function with 1 method)

julia> @benchmark do_stencil($A, $W, $B, $r, $n)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.744 ms (0.00% GC)
  median time:      24.799 ms (0.00% GC)
  mean time:        24.803 ms (0.00% GC)
  maximum time:     24.948 ms (0.00% GC)
  --------------
  samples:          202
  evals/sample:     1

julia> function do_stencil_avx(A, W, B, r, n)
           @avx for j=r:n-r-1, i=r:n-r-1, jj=-r:r, ii=-r:r
               B[i+1,j+1] += W[r+ii+1,r+jj+1] * A[i+ii+1,j+jj+1]
           end
       end
do_stencil_avx (generic function with 1 method)

julia> @benchmark do_stencil_avx($A, $W, $B, $r, $n)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.234 ms (0.00% GC)
  median time:      3.267 ms (0.00% GC)
  mean time:        3.275 ms (0.00% GC)
  maximum time:     3.452 ms (0.00% GC)
  --------------
  samples:          1527
  evals/sample:     1
haampie commented 3 years ago

See also: https://chriselrod.github.io/LoopVectorization.jl/stable/examples/filtering/

haampie commented 3 years ago

Tested it on an Intel Xeon Gold 6154 CPU @ 3.00GHz with AVX-512 too and there results are more dramatic:

julia> @benchmark do_stencil($A, $W, $B, $r, $n)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     46.800 ms (0.00% GC)
  median time:      46.831 ms (0.00% GC)
  mean time:        46.843 ms (0.00% GC)
  maximum time:     47.040 ms (0.00% GC)
  --------------
  samples:          107
  evals/sample:     1

versus

julia> @benchmark do_stencil_avx($A, $W, $B, $r, $n)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.192 ms (0.00% GC)
  median time:      2.196 ms (0.00% GC)
  mean time:        2.197 ms (0.00% GC)
  maximum time:     2.443 ms (0.00% GC)
  --------------
  samples:          2275
  evals/sample:     1

that's a 21x speedup over the scalar version (due to contracting * and + to fma, unrolling, and using 512-bits registers for almost everything; I think it uses masked operations if the convolution kernel size isn't a nice multiple of the vector width, in this case with a 7x7 kernel everything is probably masked ops).

jeffhammond commented 3 years ago

Why do I need to specific this? I hate making the PRK codes nonportable, especially interpreted languages where there is no excuse for such things to be necessary.

Is there a portable version like @simd? I've got an Apple M1 laptop now so @avx is going to break, is it not?

I can merge this but I'm curious what is wrong in the Julia implementation that they can't JIT AVX without such attributes.

chriselrod commented 3 years ago

Julia can SIMD just fine without @avx by relying on LLVM. The reason for macros like @simd is the same as adding compiler flags like --ffast-math, which are often required for SIMD because floating point is not associative. Rather than blanket applying such options, Julia gives you local control, so that you can have some code strictly adhering to IEEE -- as is often important for things like special function implementations -- and another part still autovectorizing reductions.

Regarding the @avx macro, it is defined by the Julia library LoopVectorization, which implements a separate loop optimizer distinct form LLVM's. While it still has some limitations, it outperforms LLVM, gcc, and icc in many benchmarks, e.g. convolutions like the stencil example.

I'm also working on adding automatic threading support and better handling of non-contiguous memory accesses, so some of the benchmarks should notably improve soon.

The macro is called @avx, but people on ARM have reported that it works, as have people running the M1 + Rosetta. I only have personal access to x86_64 systems, and these are also the only ones available on CI, so that is what's best tested and what I've actually been able to tune it for. The name @simd was already taken, and I wanted a short acronym to use that means vectorization, so I went with @avx.

Any cases of it not being portable are a bug. I'm looking forward to trying it out with SVE someday, for example.

haampie commented 3 years ago

I just checked on two non-x86 machines:

Fujitsu a64fx (not really supported by LLVM 11);

julia> @benchmark do_stencil($A, $W, $B, $r, $n)
  minimum time:     228.705 ms (0.00% GC)
  median time:      228.722 ms (0.00% GC)

julia> @benchmark do_stencil_avx($A, $W, $B, $r, $n)
  minimum time:     33.711 ms (0.00% GC)
  median time:      33.723 ms (0.00% GC)

Seems like it still unrolls and generates fma instructions, but does not vectorize.

And POWER8:

julia> @benchmark do_stencil($A, $W, $B, $r, $n)
  minimum time:     78.107 ms (0.00% GC)
  median time:      78.187 ms (0.00% GC)

julia> @benchmark do_stencil_avx($A, $W, $B, $r, $n)
  minimum time:     11.181 ms (0.00% GC)
  median time:      11.373 ms (0.00% GC)

which seems to vectorize, but I have a hard time getting it to show the generated assembly.

jeffhammond commented 3 years ago

Okay if it's portable, I'll merge it as soon as I can test locally. I'm currently without power or heat right now in Oregon so it might be a day or so.

jeffhammond commented 3 years ago

@haampie Is it possible for you to make a pull request for this?