JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.06k stars 5.43k forks source link

higher-order functions often slower than equivalent broadcast #34477

Open baggepinnen opened 4 years ago

baggepinnen commented 4 years ago

I could not find an issue tracking this. I often find that higher-order functions fail to vectorize where the equivalent broadcast statement vectorizes well and is much faster. An example

x = randn(Float32, 100_000)
@btime map(sign, $x) # 106.942 μs (3 allocations: 390.72 KiB)
@btime sign.($x)     # 37.170 μs (2 allocations: 390.70 KiB)

another

x = rand(Float32, 1_000_000)
xc = copy(x)
FV = -999.0f0

x[x .> 0.999f0] .= FV
@btime findall(!isequal($FV), $x); # 9.001 ms (24 allocations: 9.00 MiB)
@btime findall($x .≠ $FV);         # 1.687 ms (6 allocations: 7.75 MiB)
x .= xc
x[x .> 0.001f0] .= FV
@btime findall(!isequal($FV), $x); # 1.343 ms (14 allocations: 16.55 KiB)
@btime findall($x .≠ $FV);         # 376.731 μs (5 allocations: 134.36 KiB)

The map example above could potentially be mitigated by implementing map in terms of broadcast for some set of mapped functions known to vectorize. The second example would be harder to specialize, and would require findall to be able to specialize on its own. (all timings above are on julia v"1.5.0-DEV.130")

timholy commented 4 years ago

Is it just the presence/absence of @simd that explains this?

baggepinnen commented 4 years ago

In the first case it very well can be. In the second case I have a theory that the difference comes from findall(f,...) having to reallocate a larger array every time the allocated space fills up, whereas the findall(x .!= ...) always allocates a bitarray, but when this bitarray is known, the size of the final, larger output array can be calculated and allocated once. Interestingly, the situation in which the finall(f approach saves very few allocations, whereas the other extreme it allocates quite a bit more. The bitarray approach is thus faster in both extreme cases and allocates less memory overall