@tturbo not multithreading nested loops

jgslazzaro commented 2 years ago

I am currently running Julia with Threads.nthreads()=10, however @tturbo does not use more than one thread in the code below. The code gets a nice speed up using @turbo, however it is still faster to use Threads.@threads, I was hoping to get the best of both worlds with @tturbo.

using LoopVectorization,Interpolations

NK = 150
NB = 200
nA = 100
nZ = 30

BM = range(-5.0,stop=5,length = NB);
KM = range(0.0,stop=10,length = NK);
Z = range(-1,stop = 1,length= nZ);
A = range(0.0,stop=10,length = nA);

pdfZ = rand(nZ,nZ) ;

AK1 = rand(1,nA,NK);
DD = zeros(nZ,nA,NK,NB);
EE = zeros(nZ,nA,NK,NB);
O = zeros(nZ,nA,NK,NB);

π1 = rand(nZ,NK,NB) .+1;
Q = rand(nZ,NK,NB).+1;
W = rand(nZ,nA) .+1;

qinterp = extrapolate(Interpolations.scale(interpolate(Q, BSpline(Linear())), Z, KM,BM), Interpolations.Line());
πinterp = extrapolate(Interpolations.scale(interpolate(π1, BSpline(Linear())), Z, KM,BM), Interpolations.Line());
winterp = extrapolate(Interpolations.scale(interpolate(W, BSpline(Linear())), Z, A), Interpolations.Line());

@tturbo for ib1 ∈ eachindex(BM) , ik1 ∈ eachindex(KM),ia ∈ eachindex(A),iz ∈ eachindex(Z)   

    DD[iz,ia,ik1,ib1] = AK1[1,ia,ik1] +  qinterp(Z[iz],KM[ik1],BM[ib1]) * BM[ib1] 
    for iz1 ∈ eachindex(Z)
        EE[iz,ia,ik1,ib1]  += pdfZ[iz, iz1] * winterp(Z[iz1], πinterp(Z[iz1],KM[ik1],BM[ib1]))
    end
        O[iz,ia,ik1,ib1] = -1/max(DD[iz,ia,ik1,ib1], eps()) +(1.0 / (1.0 + 0.1)) * EE[iz,ia,ik1,ib1] 
end

chriselrod commented 2 years ago

Mind giving me an example I can copy/paste to run?

jgslazzaro commented 2 years ago

I updated the post above with a working example. Thanks!

JuliaSIMD / LoopVectorization.jl

@tturbo not multithreading nested loops #362