JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.71k stars 5.48k forks source link

Performance regression in broadcasting functions? #46365

Open BenjaminRemez opened 2 years ago

BenjaminRemez commented 2 years ago

While investigating if this performance difference changes in 1.8, I found the following regression. For example

using BenchmarkTools, Random
function func!(v)
    @. v = log(rand())
end

v = rand(10000);
@btime func!(v);

Under 1.7.3.

$ julia +release script.jl
  135.600 μs (0 allocations: 0 bytes)

Under 1.8-rc4

$ julia +1.8 script.jl
  199.900 μs (0 allocations: 0 bytes)

I should note that changing the internal function to exp i.e. @. v = exp(rand), there is a dramatic performance boost in 1.8-rc4, which appears due to the fact that exp is inlined in 1.8 while it was not in 1.7 - though I understand that may be a bug rather than a feature (#46323, #46359).

N5N3 commented 2 years ago

Could confirm on my PC.

BTW this version is staticly (3x) faster than func! on 1.8/master

function func2!(v)
    Random.seed!(0)
    @. v = rand()
    @. v = log.(v)
end
@btime func!($v) #178.200 μs (6 allocations: 464 bytes)    | 1.7: 122.000 μs (6 allocations: 464 bytes)
@btime func2!($v) # 66.600 μs (6 allocations: 464 bytes)   | 1.7: 70.600 μs (6 allocations: 464 bytes)
gbaraldi commented 2 years ago

@oscardssmith

BenjaminRemez commented 2 years ago

@N5N3 Yes, that's the performance difference I spotted already in this Discord post. (I can open a separate issue for that disparity as well.)

vtjnash commented 2 years ago

@oscardssmith is this fixed by https://github.com/JuliaLang/julia/pull/46359?

oscardssmith commented 2 years ago

I doubt it since this code never calls exp.

adienes commented 9 months ago

down to

julia> @btime func!(v);
  57.542 μs (0 allocations: 0 bytes)

on master. regression fixed!

BenjaminRemez commented 9 months ago

@adienes @vtjnash Are you sure this benchmark does not compare favorably due to your test hardware? Running on today's nightly build, I get:

julia> versioninfo()
Julia Version 1.11.0-DEV.1348
Commit a9e2bd4713 (2024-01-21 11:16 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, icelake-client)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

julia> @btime func!(v);
  201.100 μs (0 allocations: 0 bytes)

which is comparable to the performance on 1.8.3. My original post is on this same hardware, and I can reproduce that timing:

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, icelake-client)
  Threads: 1 on 8 virtual cores

julia> @btime func!(v);
  200.400 μs (0 allocations: 0 bytes)

(In fact, using @benchmark suggests 1.8 has a consistent minimal time that is lower.)

adienes commented 9 months ago

I was just scrolling through older regressions to see if any had been fixed --- if I was too hasty and you can still reproduce the problem, it would be appropriate I think to re-open this issue.

However, in that case I would add the tag windows (I cannot add this though) or change the title to reflect that it may be platform-dependent

BenjaminRemez commented 9 months ago

I cannot add tags or re-open the issue (can change the title, though). I don't have access a non-Windows machine at the moment - could you confirm whether the original regression does not appear on another platform at all, or that it is fixed on master?

adienes commented 9 months ago

my original comment was made after testing on my hardware, which is

Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores

the result over version history is:

and only tried on 1.7 vs master before posting my first comment. interestingly, not only does the original regression not occur, but there seems to be a separate one from 1.10 to master