Closed danielwe closed 2 years ago
https://github.com/JuliaSIMD/StrideArraysCore.jl/commit/d9c13e936ce14d46e2c3f5620691b14c989c572b
julia> using BenchmarkTools, FastBroadcast, Polyester
julia> tanh_fastbroadcast!(x, y) = (@.. thread=true x = tanh(y))
tanh_fastbroadcast! (generic function with 1 method)
julia> function tanh_batch!(x, y)
@batch for i in eachindex(x, y)
x[i] = tanh(y[i])
end
end
tanh_batch! (generic function with 1 method)
julia> N = 32; x = zeros(N);
julia> @btime tanh_fastbroadcast!($x, y) setup=(y = randn(N));
415.635 ns (0 allocations: 0 bytes)
julia> @btime tanh_batch!($x, y) setup=(y = randn(N));
431.638 ns (0 allocations: 0 bytes)
Using
@.. thread=true
produces an allocation. Looks like a variable ends up being boxed or something like that. Writing out the equivalent loop and using@batch
from Polyester avoids the allocation. (In fairness, the overhead is usually not devastating---in the timings below the allocating version actually won out due to laptop CPU throttling.)MWE: