JuliaSIMD / LoopVectorization.jl

Macro(s) for vectorizing loops.
MIT License
742 stars 66 forks source link

Strided Loops with @avx ? #132

Closed LaurentPlagne closed 3 years ago

LaurentPlagne commented 4 years ago

Hi, I wonder if @avx can be used with strided loops like

function smooth_line(nrm1,j,i1,sl,rl,ih2,denom)
    @fastmath @inbounds @simd for i=i1:2:nrm1
            sl[i,j]=denom*(rl[i,j]+ih2*(sl[i,j-1]+sl[i-1,j]+sl[i+1,j]+sl[i,j+1]))
    end
end

It seems that, changing @fastmath @inbounds @simd by @avx does not lead to the same result

chriselrod commented 4 years ago

Hi, thanks for the issue. There are two problems with this loop currently:

  1. I need to add support for strided loops. I'd have to think more about the best way to handle this. One possible solution is to forbid vectorizing them, but it would probably be better to allow that as a possibility. One idea is to treat the accessed arrays as having a dimension each to the stride ahead of the loop. I.e, for n in 1:step:N; x[n] would translate to for n in 1:div(N,step); x[1,n] where x is treated as having length step in the first dimension.
  2. Loop carried dependencies aren't supported yet.
LaurentPlagne commented 4 years ago

I Chris, thanks for the reply.

  1. I tried this solution and @avx produces (twice) slower code than @simd
    @inline function smooth_line_avx(nrm1,j,i1,sl,rl,ih2,denom)
    ni2=div(nrm1,2)-1
    @avx for i2=0:ni2
        sl[i1+2i2,j]=denom*(rl[i1+2i2,j]+ih2*(sl[i1+2i2,j-1]+sl[i1+2i2-1,j]+sl[i1+2i2+1,j]+sl[i1+2i2,j+1]))
    end
    end
  2. I think there should be no dependency in this loop (part of red-black gauss-seidel).

I think that the issue should be closed but, if it possible (I don't know if it is), it would be nice to have a warning message saying that strided loops are not supported by @avx (or at least on the README).

BTW: are nested loops handled by @avx ?

chriselrod commented 4 years ago

Mind sharing code I can copy and past to run the functions?

I tried this solution and @avx produces (twice) slower code than @simd

Does @simd actually vectorize? I assume it does not, and that not vectorizing is faster. At least with AVX2. I'm curious how they compare on AVX512, but (while they at least exist) scatter instructions are slow there, so I'm not really expecting this to be better. If you could reorganize the data, that would help a lot.

I think there should be no dependency in this loop (part of red-black gauss-seidel).

Ha ha, I wasn't thinking of the stride. You're right, they're independent.

I think that the issue should be closed but, if it possible (I don't know if it is), it would be nice to have a warning message saying that strided loops are not supported by @avx (or at least on the README).

I'll add a check to result in generic fall-back behavior, and add a note in the documention.

BTW: are nested loops handled by @avx ?

Yes.

LaurentPlagne commented 4 years ago

For information I want to perform a 2D gauss-seidel. In order to obtain some parallelism, I change this for a red-black GS were even cells (i+j%2==0) can be updated (independently) from odd ones in a first step. The odds from the even in the second step. The following should be runable.

using BenchmarkTools
using LoopVectorization
using LinearAlgebra

#Original functions (smooth_line and smooth_seq using @simd)
@inline function smooth_line(nrm1,j,i1,sl,rl,ih2,denom)
    @fastmath @inbounds @simd for i=i1:2:nrm1
            sl[i,j]=denom*(rl[i,j]+ih2*(sl[i,j-1]+sl[i-1,j]+sl[i+1,j]+sl[i,j+1]))
    end
end

@inline function smooth_seq(nrows,ncols,sl,rl,ih2)
    denom=1/(4ih2)
    nrm1=nrows-1

    @inbounds for j=2:2:ncols-1
        smooth_line(nrm1,j,2,sl,rl,ih2,denom)
        smooth_line(nrm1,j+1,3,sl,rl,ih2,denom)
    end
    @inbounds for j=2:2:ncols-1
        smooth_line(nrm1,j,3,sl,rl,ih2,denom)
        smooth_line(nrm1,j+1,2,sl,rl,ih2,denom)
    end

end

#@avx'ed functions (smooth_line_avx and smooth_seq_avx using @simd)

@inline function smooth_line_avx(nrm1,j,i1,sl,rl,ih2,denom)
    #The trick to avoid strided loop and use @avx
    ni2=div(nrm1,2)-1

    # @inbounds @simd for i2=0:ni2
    @avx for i2=0:ni2
        sl[i1+2i2,j]=denom*(rl[i1+2i2,j]+ih2*(sl[i1+2i2,j-1]+sl[i1+2i2-1,j]+sl[i1+2i2+1,j]+sl[i1+2i2,j+1]))
    end
end

@inline function smooth_seq_avx(nrows,ncols,sl,rl,ih2)
    denom=1/(4ih2)
    nrm1=nrows-1

    @inbounds for j=2:2:ncols-1
        smooth_line_avx(nrm1,j,2,sl,rl,ih2,denom)
        smooth_line_avx(nrm1,j+1,3,sl,rl,ih2,denom)
    end
    @inbounds for j=2:2:ncols-1
        smooth_line_avx(nrm1,j,3,sl,rl,ih2,denom)
        smooth_line_avx(nrm1,j+1,2,sl,rl,ih2,denom)
    end

end

#function that call the benchmarks
function go(n)
    # initialize two 2D arrays
    sl=rand(n+2,n+2)
    rl=rand(n+2,n+2)
    # the following blocks check if smooth_seq and smooth_seq_avx
    # return the same result 
    sl_ref=deepcopy(sl)
    (nrows,ncols)=size(sl)
    ih2=rand()
    smooth_seq(nrows,ncols,sl_ref,rl,ih2)
    smooth_seq_avx(nrows,ncols,sl,rl,ih2)
    @show norm(sl-sl_ref)
    #The actual timings
    tsimd=@belapsed smooth_seq($nrows,$ncols,$sl,$rl,$ih2)
    tavx=@belapsed smooth_seq_avx($nrows,$ncols,$sl,$rl,$ih2)
    @show tsimd,tavx
    tsimd,tavx
end

@show go(128)
@show go(256)

I could easily change the layout but I don't know what would be optimal... I tried to put @avx inside the outer loop but this failed

chriselrod commented 4 years ago

FWIW, they're about the same fast on my computer:

julia> @show go(128)
norm(sl - sl_ref) = 2.1216300999233956e-14
(tsimd, tavx) = (2.2724e-5, 2.1082e-5)
go(128) = (2.2724e-5, 2.1082e-5)
(2.2724e-5, 2.1082e-5)

julia> @show go(256)
norm(sl - sl_ref) = 1.4344236152357477e-14
(tsimd, tavx) = (8.4377e-5, 8.5123e-5)
go(256) = (8.4377e-5, 8.5123e-5)
(8.4377e-5, 8.5123e-5)

julia> @show go(512)
norm(sl - sl_ref) = 2.4506837318300105e-14
(tsimd, tavx) = (0.000407476, 0.000401408)
go(512) = (0.000407476, 0.000401408)
(0.000407476, 0.000401408)

julia> @show go(510)
norm(sl - sl_ref) = 2.5067239002253036e-12
(tsimd, tavx) = (0.000376414, 0.000371184)
go(510) = (0.000376414, 0.000371184)
(0.000376414, 0.000371184)

julia> @show go(514)
norm(sl - sl_ref) = 2.4883860367553863e-14
(tsimd, tavx) = (0.000376636, 0.000395572)
go(514) = (0.000376636, 0.000395572)
(0.000376636, 0.000395572)

Using LinuxPerf#master with n = 120:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults)" foreachn(smooth_seq, 100_000, nrows,ncols,sl_ref,rl,ih2)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               8.04e+09  100.0%  #  3.6 cycles per ns
┌ instructions             2.03e+10  100.0%  #  2.5 insns per cycle
│ branch-instructions      2.25e+09  100.0%  # 11.1% of instructions
└ branch-misses            8.49e+05  100.0%  #  0.0% of branch instructions
┌ task-clock               2.24e+09  100.0%  #  2.2 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults)" foreachn(smooth_seq_avx, 100_000, nrows,ncols,sl_ref,rl,ih2)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               7.44e+09  100.0%  #  3.6 cycles per ns
┌ instructions             7.01e+09  100.0%  #  0.9 insns per cycle
│ branch-instructions      2.59e+08  100.0%  #  3.7% of instructions
└ branch-misses            3.64e+05  100.0%  #  0.1% of branch instructions
┌ task-clock               2.07e+09  100.0%  #  2.1 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.00e+00  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Note that the base version had 2.5 instructions/cycle, while the @avx had only 0.9, but they took roughly the same amount of time.

While the base version does show SIMD instructions (including gathers and scatters), I'd have to look a lot more closely to see if it is actually run. I'm guessing that there's loop versioning involved, and some checks determine whether or not the SIMD version is executed. While I'm not sure, the large difference in instructions/cycle but similar performance suggests that is the case. EDIT:

foreachn(f::F, N, args::Vararg{<:Any,Nargs}) where {F,Nargs} = foreach(_ -> f(args...), 1:N)

I tried to put @avx inside the outer loop but this failed

It should be applied to loops of elementary operations.

chriselrod commented 4 years ago

@LaurentPlagne You keep tagging a GitHub user named @avx. They may prefer not to get all these notifications about our discussion ;).

I could easily change the layout but I don't know what would be optimal...

Try making the even and odd numbers contiguous. For example:

function smooth_line(nrm1,j,i1,sl,rl,ih2,denom)
    @fastmath @inbounds @simd for i=i1:2:nrm1
            sl[i,j]=denom*(rl[i,j]+ih2*(sl[i,j-1]+sl[i-1,j]+sl[i+1,j]+sl[i,j+1]))
    end
end
function smooth_line_dim3_even(nrm1,j,sl,rl,ih2,denom)
    @fastmath @inbounds @simd for i= 1:nrm1 >>> 1
            sl[i,2,j]=denom*(rl[i,2,j]+ih2*(sl[i,2,j-1]+sl[i,1,j]+sl[i+1,1,j]+sl[i,2,j+1]))
    end
end
function smooth_line_dim3_odd(nrm1,j,sl,rl,ih2,denom)
    @fastmath @inbounds @simd for i = 2:nrm1 >>> 1
            sl[i,1,j]=denom*(rl[i,1,j]+ih2*(sl[i,1,j-1]+sl[i-1,2,j]+sl[i,2,j]+sl[i,1,j+1]))
    end
end
nrm1 = 128;
number_rows = nrm1 + 1 + iseven(nrm1); # make it divisible by 2
sl = rand(number_rows,4);
rl = rand(number_rows,4);
ih2 = rand(); denom = rand();
slnew = permutedims(reshape(sl, (2, size(sl,1) ÷ 2, size(sl,2))), (2,1,3));
rlnew = permutedims(reshape(rl, (2, size(sl,1) ÷ 2, size(sl,2))), (2,1,3));

smooth_line(nrm1,2,2,sl,rl,ih2,denom)
smooth_line(nrm1,3,3,sl,rl,ih2,denom)

smooth_line_dim3_even(nrm1,2,slnew,rlnew,ih2,denom)
smooth_line_dim3_odd(nrm1,3,slnew,rlnew,ih2,denom)

slcomp = permutedims(reshape(sl, (2, size(sl,1) ÷ 2, size(sl,2))), (2,1,3));
slnew ≈ slcomp # true

@benchmark smooth_line($nrm1,2,2,$sl,$rl,$ih2,$denom)
@benchmark smooth_line_dim3_even($nrm1,2,$slnew,$rlnew,$ih2,$denom)
@benchmark smooth_line_dim3_odd($nrm1,3,$slnew,$rlnew,$ih2,$denom)

smooth_line_dim3_* are much faster on my computer. Using @avx is a smidge faster on my computer than @fastmath @inbounds @simd.

LaurentPlagne commented 4 years ago

Wow !!! dim3_even/odd are indeed much faster on my machine: with simd

n, t_smooth_line, t_smooth_line_dim3_even, t_mooth_line_dim3_odd) = (128, 7.164168377823409e-8, 2.638894472361809e-8, 2.930251256281407e-8)
(n, t_smooth_line, t_smooth_line_dim3_even, t_mooth_line_dim3_odd) = (256, 1.3984382284382283e-7, 4.398686868686869e-8, 4.9574468085106385e-8)
(n, t_smooth_line, t_smooth_line_dim3_even, t_mooth_line_dim3_odd) = (512, 2.413283950617284e-7, 8.683245521601686e-8, 1.0268404255319149e-7)

with @avx

(n, t_smooth_line, t_smooth_line_dim3_even, t_mooth_line_dim3_odd) = (128, 6.286530612244897e-8, 2.453714859437751e-8, 3.226458752515091e-8)
(n, t_smooth_line, t_smooth_line_dim3_even, t_mooth_line_dim3_odd) = (256, 1.2445434298440981e-7, 4.064076690211907e-8, 4.8659919028340086e-8)
(n, t_smooth_line, t_smooth_line_dim3_even, t_mooth_line_dim3_odd) = (512, 2.415175879396985e-7, 7.580020597322349e-8, 9.708236536430835e-8)

I have to think more about this data layout (I need to make some drawings ;)

This reminds me a paper I wrote about automatic data layout adaption in a C++ library C++ (https://git.triscale-innov.com/Triscale-public/Legolas_pp). The principle is to automatically vectorize (and parallelize) embarrassingly // algorithms on nd-arrays by introducing an (hidden) extra dimension to the data layout of user arrays. . I still wonder if the same could be done in Julia (not sure). I hope we can discuss this one day.

Anyway: Thank you very much for your help !

P.S. I apologize to atavx !

chriselrod commented 4 years ago

What I did is sort of similar to slide 16/38 (page 27). Because we want SIMD across even/odd elements at a time, reshape the arrays so that they're contiguous rather than interleaved.

I still wonder if the same could be done in Julia (not sure). I hope we can discuss this one day.

I'm very interested in this as well, and was planning on working on this eventually in a DSL for Bayesian model fitting. What would be need is both a way to estimate computational cost of different layouts, and then the ability to iterate across many different alternatives to pick the best.

A DSL would be an easier place to get started / do a proof of concept. The analysis should be made modular, so that eventually it can be used by code (e.g. macros) meant to optimize more general Julia functions.

I think it's an interesting question with lots of potential performance benefits.

LaurentPlagne commented 4 years ago

I'm very interested in this as well, and was planning on working on this eventually in a DSL for Bayesian model fitting. What would be need is both a way to estimate computational cost of different layouts, and then the ability to iterate across many > different alternatives to pick the best.

In the paper case, the optimal layout was easy to find because it deals specifically with applying the same arbitrarily complex algorithm on N independent problems. So basically, we replace scalars at the lowest level by simd packs of scalar and use MT at the highest level on groups on packed problems. So the optimal packing factor was usually 2xtimes or 4xtime the simd width.

Ideally, I could make the presentation for you (20 min) and we discuss after this what is possible in Julia. You tell me when you want to setup a call.

I did rewrite the previous problem using a struct handling the RB layout in order to introduce it in the multi-grid algorithm step by step. On my machine the speed-up with @avx is close to 3 !

using BenchmarkTools
using LoopVectorization
using LinearAlgebra

#Original functions (smooth_line and smooth_seq using @simd)
@inline function smooth_line(nrm1,j,i1,sl,rl,ih2,denom)
    @fastmath @inbounds @simd for i=i1:2:nrm1
            # @show i,j,denom,sl[i,j],rl[i,j],sl[i,j-1],sl[i,j+1],sl[i-1,j],sl[i+1,j]
            sl[i,j]=denom*(rl[i,j]+ih2*(sl[i,j-1]+sl[i-1,j]+sl[i+1,j]+sl[i,j+1]))
    end
end

@inline function smooth_seq(nrows,ncols,sl,rl,ih2)
    denom=1/(4ih2)
    nrm1=nrows-1
    #even (i+j)%2==0 from odds
    @inbounds for j=2:2:ncols-1
        smooth_line(nrm1,j,2,sl,rl,ih2,denom)
        smooth_line(nrm1,j+1,3,sl,rl,ih2,denom)
    end
    # #odds (i+j)%2==1 from evens
    @inbounds for j=2:2:ncols-1
        smooth_line(nrm1,j,3,sl,rl,ih2,denom)
        smooth_line(nrm1,j+1,2,sl,rl,ih2,denom)
    end

end

#RBArray is a type acting like a 2D array with a data layout suitable to
# Red Black Gauss-Seidel algorithms
#  contiguous red (iseven(i+j)) values are stored in [i2,1,j] 
#  contiguous odd (isodd(i+j)) values are stores in [i2,2,j] 
struct RBArray
    data::Array{Float64,3}
end
#Ctor : should use an inner Ctor to ensure that nx is even
#Note that >>> means unsigned >>. Hence x >>> 1 returns div(x,2) is x is a positive integer
RBArray(nx,ny) = RBArray(Array{Float64,3}(undef,nx>>>1,2,ny))
#Helper functions to acess the the RB arrays with 2 indexes rb[i,j] 
@inline Base.getindex(rb::RBArray,i,j) = rb.data[(i+1)>>>1,1+isodd(i+j),j]
@inline Base.setindex!(rb::RBArray,value,i,j) = rb.data[(i+1)>>>1,1+isodd(i+j),j]=value

#RBArray from Array
function RBArray(a::Array{Float64,2}) 
    nx,ny=size(a)
    @assert iseven(nx)
    nxs2=nx>>>1
    rb=RBArray(nx,ny)
    for i in 1:nx,j in 1:ny
        rb[i,j]=a[i,j]
    end
    for j in 1:2:ny 
        #j odd
        for i2 in 0:nxs2-1
            rb.data[i2+1,1,j]=a[2i2+1,j] #i=2i2+1 is odd  => i+j is even (rb.data[:,1,:])
            rb.data[i2+1,2,j]=a[2i2+2,j] #i=2i2+2 is even => i+j is odd (rb.data[:,2,:])
        end
        #j+1 even  
        for i2 in 0:nxs2-1 #i=2i2 is even
            rb.data[i2+1,2,j+1]=a[2i2+1,j+1] #i=2i2+1 is odd  => i+j is even (rb.data[:,2,:])
            rb.data[i2+1,1,j+1]=a[2i2+2,j+1] #i=2i2+2 is even => i+j is odd (rb.data[:,1,:])
        end  
    end
    rb
end

#Array from RBArray
function Array{Float64,2}(rb::RBArray)
    nxs2,two,ny=size(rb.data) 
    nx=2nxs2
    a=Array{Float64,2}(undef,nx,ny)
    #The following blocks works but is probably slow
    # for i in 1:nx,j in 1:ny
    #     a[i,j]=rb[i,j]
    # end
    for j in 1:2:ny 
        #j odd
        for i2 in 0:nxs2-1 #i=2i2 is even
            a[2i2+1,j]=rb.data[i2+1,1,j]
            a[2i2+2,j]=rb.data[i2+1,2,j]
        end
        #j+1 even  
        for i2 in 0:nxs2-1 #i=2i2 is even
            a[2i2+1,j+1]=rb.data[i2+1,2,j+1]
            a[2i2+2,j+1]=rb.data[i2+1,1,j+1]
        end  
    end
    a
end

#Specialized smooth_seq method for RedBlack arrays.
#Note that non specialized smooth_seq method should produce the same result(but slowly)
@inline function smooth_seq(nrows,ncols,slrb::RBArray,rlrb::RBArray,ih2)
    denom=1/(4ih2)
    nrm1=nrows-1
    sl,rl=slrb.data,rlrb.data

    #evens from odds
    @inbounds for j in 2:2:ncols-1 
        @avx for i2 in 1:nrm1 >>> 1
            sl[i2,1,j]=denom*(rl[i2,1,j]+ih2*(sl[i2,2,j-1]+sl[i2,2,j+1]+sl[i2,2,j]+sl[i2+1,2,j]))
        end
        @avx for i2 in 1:nrm1 >>> 1
            sl[i2+1,1,j+1]=denom*(rl[i2+1,1,j+1]+ih2*(sl[i2+1,2,j]+sl[i2+1,2,j+2]+sl[i2,2,j+1]+sl[i2+1,2,j+1]))
        end
    end
     #odds from evens
    @inbounds for j in 2:2:ncols-1 
        @avx for i2 in 1:nrm1 >>> 1
            sl[i2+1,2,j]=denom*(rl[i2+1,2,j]+ih2*(sl[i2+1,1,j-1]+sl[i2+1,1,j+1]+sl[i2,1,j]+sl[i2+1,1,j]))
        end
        @avx for i2 in 1:nrm1 >>> 1
            sl[i2,2,j+1]=denom*(rl[i2,2,j+1]+ih2*(sl[i2,1,j]+sl[i2,1,j+2]+sl[i2,1,j+1]+sl[i2+1,1,j+1]))
        end
    end
end

#function that calls the benchmarks
function go_rb(n)
    @assert iseven(n)
    # initialize two 2D arrays
    sl=rand(n+2,n+2)
    rl=rand(n+2,n+2)
    # the following blocks check if smooth_seq and smooth_seq_avx
    # return the same result 
    sl_ref=deepcopy(sl)
    slrb=RBArray(sl)
    rlrb=RBArray(rl)
    (nrows,ncols)=size(sl)
    ih2=rand()
    smooth_seq(nrows,ncols,sl,rl,ih2)
    smooth_seq(nrows,ncols,slrb,rlrb,ih2)
    ss=Array{Float64,2}(slrb)
    @show norm(sl-Array{Float64,2}(slrb))

    # #The actual timings
    tsimd=@belapsed smooth_seq($nrows,$ncols,$sl,$rl,$ih2)
    trb=@belapsed smooth_seq($nrows,$ncols,$slrb,$rlrb,$ih2)
    @show tsimd,trb

end

go_rb(128)
go_rb(256)
go_rb(512)

which returns:

julia> include("test/test_perf.jl")
norm(sl - Array{Float64, 2}(slrb)) = 7.88864046354252e-15
(tsimd, trb) = (1.7441e-5, 6.279e-6)
norm(sl - Array{Float64, 2}(slrb)) = 1.530024689710092e-14
(tsimd, trb) = (7.0368e-5, 2.9217e-5)
norm(sl - Array{Float64, 2}(slrb)) = 2.7216263199840184e-14
(tsimd, trb) = (0.000283891, 0.000111511)
(0.000283891, 0.000111511)
chriselrod commented 4 years ago

In the paper case, the optimal layout was easy to find because it deals specifically with applying the same arbitrarily complex algorithm on N independent problems. So basically, we replace scalars at the lowest level by simd packs of scalar and use MT at the highest level on groups on packed problems. So the optimal packing factor was usually 2xtimes or 4xtime the simd width.

What determines the optimal packing factor? Dependencies in the computation you're performing?

Ideally, I could make the presentation for you (20 min) and we discuss after this what is possible in Julia. You tell me when you want to setup a call.

I'm free over the next few days. I live in the US, so later is preferable. Starting next Monday, I'd be free 13:00-15:00 French time (before work).

StructArrays is a library that can often make SPMD-style programming/vectorization easier.

chriselrod commented 3 years ago

Fixed on LoopVectorization 0.12. Rerunning these benchmarks:

julia> @show go(128)
norm(sl - sl_ref) = 0.0
(tsimd, tavx) = (2.1866e-5, 1.9846e-5)
go(128) = (2.1866e-5, 1.9846e-5)
(2.1866e-5, 1.9846e-5)

julia> @show go(256)
norm(sl - sl_ref) = 0.0
(tsimd, tavx) = (8.5776e-5, 8.1598e-5)
go(256) = (8.5776e-5, 8.1598e-5)
(8.5776e-5, 8.1598e-5)

Now defining to use 1:2:nrm1 instead, like the @simd loop and previously did not work:

julia> #Original functions (smooth_line and smooth_seq using @simd) # now with `@avx` and `1:2:nrm1`
       @inline function smooth_line_avx(nrm1,j,i1,sl,rl,ih2,denom)
           @avx for i=i1:2:nrm1
                   sl[i,j]=denom*(rl[i,j]+ih2*(sl[i,j-1]+sl[i-1,j]+sl[i+1,j]+sl[i,j+1]))
           end
       end
smooth_line_avx (generic function with 1 method)

julia> @show go(128)
norm(sl - sl_ref) = 0.0
(tsimd, tavx) = (2.2535e-5, 2.0019e-5)
go(128) = (2.2535e-5, 2.0019e-5)
(2.2535e-5, 2.0019e-5)

julia> @show go(256)
norm(sl - sl_ref) = 0.0
(tsimd, tavx) = (8.3112e-5, 8.0237e-5)
go(256) = (8.3112e-5, 8.0237e-5)
(8.3112e-5, 8.0237e-5)