Open chrstphrbrns opened 1 year ago
It looks like there's also quite a bit of performance left on the table in both cases..
julia> @btime sum(sin.($a))
6.207 ms (2 allocations: 7.63 MiB)
-0.11710952409819408
julia> @btime sum(sin,$a)
5.925 ms (0 allocations: 0 bytes)
-0.11710952409817987
julia> using LoopVectorization
julia> @btime vmapreduce(sin, +, $a)
1.501 ms (0 allocations: 0 bytes)
-0.11710952409810094
(compared to LoopVectorization.jl)
sum(f,A)
performs significantly worse thansum(f.(A))
for integer inputs to certain transcendental functions on x86 (maybe specific to AMD?)Different story on Apple Silicon