Closed kamesy closed 1 year ago
If @batch
is giving worse performance than Threads.@threads
, you could try @batch per=thread for
.
Also, @simd
would help on the inner most loop.
For this particular example
using LoopVectorization
function mean2_tturbo!(y::AbstractVector, x::AbstractMatrix)
m = size(x, 2)
a = 1 / m
@tturbo for i in eachindex(y)
x̄ = zero(eltype(x))
for j in 1:m
x̄ += x[i,j]
end
y[i] = a * x̄
end
return y
end
should be much faster. Note that
for i in eachindex(y)
x̄ = zero(eltype(x))
for j in 1:m
x̄ += x[i,j]
should generally be faster than
for i in eachindex(y)
x̄ = x[i,1]
for j in 2:m
x̄ += x[i,j]
when using @simd
, @fastmath
, or @turbo
.
I can reproduce view
being slow with --check-bound=no
. In fact, it hasn't terminated for me yet.
EDIT: I just got a segfault, so my current guess would be that it is slow because it is wrong, and working with junk data.
For convenience, a script to time them all:
using Polyester, LoopVectorization
function mean2_batch!(y::AbstractVector, x::AbstractMatrix)
m = size(x, 2)
a = 1 / m
@batch for i in eachindex(y)
x̄ = x[i,1]
for j in 2:m
x̄ += x[i,j]
end
y[i] = a * x̄
end
return y
end
function mean2_batch_view!(y::AbstractVector, x::AbstractMatrix)
m = size(x, 2)
a = 1 / m
@batch for i in eachindex(y)
v = view(x, i, :)
x̄ = v[1]
for j in 2:m
x̄ += v[j]
end
y[i] = a * x̄
end
return y
end
function mean2_threads!(y::AbstractVector, x::AbstractMatrix)
m = size(x, 2)
a = 1 / m
Threads.@threads for i in eachindex(y)
x̄ = x[i,1]
for j in 2:m
x̄ += x[i,j]
end
y[i] = a * x̄
end
return y
end
function mean2_threads_view!(y::AbstractVector, x::AbstractMatrix)
m = size(x, 2)
a = 1 / m
Threads.@threads for i in eachindex(y)
v = view(x, i, :)
x̄ = v[1]
for j in 2:m
x̄ += v[j]
end
y[i] = a * x̄
end
return y
end
function mean2_tturbo!(y::AbstractVector, x::AbstractMatrix)
m = size(x, 2)
a = 1 / m
@tturbo for i in eachindex(y)
x̄ = zero(eltype(x))
for j in 1:m
x̄ += x[i,j]
end
y[i] = a * x̄
end
return y
end
n, m = (300^3, 48);
x = randn(n, m);
y = randn(n);
@time mean2_batch!(y, x);
@time mean2_batch_view!(y, x);
@time mean2_threads!(y, x);
@time mean2_threads_view!(y, x);
@time mean2_tturbo!(y, x);
@time mean2_batch!(y, x);
@time mean2_batch!(y, x);
@time mean2_batch_view!(y, x);
@time mean2_batch_view!(y, x);
@time mean2_threads!(y, x);
@time mean2_threads!(y, x);
@time mean2_threads_view!(y, x);
@time mean2_threads_view!(y, x);
@time mean2_tturbo!(y, x);
@time mean2_tturbo!(y, x);
@benchmark mean2_batch!($y, $x)
@benchmark mean2_batch_view!($y, $x)
@benchmark mean2_threads!($y, $x)
@benchmark mean2_threads_view!($y, $x)
@benchmark mean2_tturbo!($y, $x)
FWIW, using --check-bounds=no
also gives me really bad performance for mean2_batch!
.
Anyway, just using Cthulhu.@descend
shows the problem is that type inference fails when --check-bounds=no
.
I'd say it's a base Julia issue, not a Polyester issue.
Closing in favor of: https://github.com/JuliaLang/julia/issues/49472
Awesome. Thanks!
On Julia 1.8 there seems to be an odd performance regression/issue when the session is started with
--check-bounds=no
and the@batch
-loop contains views:The two loops with
@threads
: