Open fkastner opened 3 years ago
Could you profile it?
It should of course be the same mul!
(and thus dgemm!
) each time, and I'd think that is where it is spending most of the time.
Why should be bar
slower than foo
? I think the baz
here is a red herring
I cannot reproduce this on 1.6.0-rc1 (locally compiled)
julia> test(10,1000,100)
180.400 μs (5 allocations: 867.47 KiB)
117.700 μs (0 allocations: 0 bytes)
115.300 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.6.0-rc1
Commit a58bdd9010* (2021-02-06 15:49 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
JULIA_PKG_SERVER =
JULIA_NUM_THREADS = 4
julia> BLAS._get_num_threads()
4
Seems to be AMD specific.
Why should be
bar
slower thanfoo
? I think thebaz
here is a red herring
Yes, that doesn't make sense to me.
I thought profiling could confirm how much time dgemm!
in particular is taking in each.
The preallocated version should also benefit from memory being hot in the cache. Maybe the hardware prefetcher throws things off.
Given that @fkastner is on Linux, using LinuxPerf to look at cache misses would also be helpful.
@profile for _=1:1000; foo(m, n, k); end;
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
╎154 @Base/client.jl:506; _start()
╎ 154 @Base/client.jl:296; exec_options(::Base.JLOptions)
╎ 154 @Base/Base.jl:368; include(::Module, ::String)
╎ 154 @Base/Base.jl:380; include(::Function, ::Module, ::String)
╎ 154 ...linux64/build/usr/share/julia/stdlib/v1.5/Profile/src/Profile.jl:28; top-level scope
╎ 154 /home/kastner/Dokumente/Julia/playground/testAllocation.jl:66; macro expansion
╎ ╎ 75 /home/kastner/Dokumente/Julia/playground/testAllocation.jl:9; foo(::Int64, ::Int64, ::Int64)
╎ ╎ 75 @Base/broadcast.jl:845; materialize!
╎ ╎ 75 @Base/broadcast.jl:848; materialize!
╎ ╎ 75 @Base/broadcast.jl:894; copyto!
╎ ╎ 75 @Base/array.jl:357; fill!
74╎ ╎ ╎ 75 @Base/array.jl:847; setindex!
╎ ╎ 49 /home/kastner/Dokumente/Julia/playground/testAllocation.jl:10; foo(::Int64, ::Int64, ::Int64)
╎ ╎ 49 .../build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:208; mul!
╎ ╎ 49 .../build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:169; mul!
╎ ╎ 49 ...build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:597; gemm_wrapper!(::Array{Float64,2}, ::Char, ::Char, ::Array{Float64,2}, ::Array{Float64,2}, ::Li...
49╎ ╎ 49 ...4/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/blas.jl:1374; gemm!(::Char, ::Char, ::Float64, ::Array{Float64,2}, ::Array{Float64,2}, ::Float64, ::Array{F...
Total snapshots: 172
and
@profile for _=1:1000; bar(A, B, C); end;
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
╎200 @Base/client.jl:506; _start()
╎ 200 @Base/client.jl:296; exec_options(::Base.JLOptions)
╎ 200 @Base/Base.jl:368; include(::Module, ::String)
╎ 200 @Base/Base.jl:380; include(::Function, ::Module, ::String)
╎ 200 ...inux64/build/usr/share/julia/stdlib/v1.5/Profile/src/Profile.jl:28; top-level scope
╎ 200 /home/kastner/Dokumente/Julia/playground/testAllocation.jl:70; macro expansion
╎ ╎ 30 /home/kastner/Dokumente/Julia/playground/testAllocation.jl:15; bar(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,2})
╎ ╎ 30 @Base/broadcast.jl:845; materialize!
╎ ╎ 30 @Base/broadcast.jl:848; materialize!
╎ ╎ 30 @Base/broadcast.jl:894; copyto!
╎ ╎ 30 @Base/array.jl:357; fill!
30╎ ╎ ╎ 30 @Base/array.jl:847; setindex!
╎ ╎ 133 /home/kastner/Dokumente/Julia/playground/testAllocation.jl:16; bar(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,2})
╎ ╎ 133 @Base/broadcast.jl:845; materialize!
╎ ╎ 133 @Base/broadcast.jl:848; materialize!
╎ ╎ 133 @Base/broadcast.jl:894; copyto!
╎ ╎ 133 @Base/array.jl:357; fill!
132╎ ╎ ╎ 133 @Base/array.jl:847; setindex!
╎ ╎ 37 /home/kastner/Dokumente/Julia/playground/testAllocation.jl:17; bar(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,2})
╎ ╎ 37 .../build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:208; mul!
╎ ╎ 37 ...build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:169; mul!
╎ ╎ 37 ...build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:597; gemm_wrapper!(::Array{Float64,2}, ::Char, ::Char, ::Array{Float64,2}, ::Array{Float64,2}, ::L...
37╎ ╎ 37 .../build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/blas.jl:1374; gemm!(::Char, ::Char, ::Float64, ::Array{Float64,2}, ::Array{Float64,2}, ::Float64, ::Array{...
Total snapshots: 227
julia> @measure foo(m,n,k)
┌───────────────────────┬─────────────┬─────────────┐
│ │ Events │ Active Time │
├───────────────────────┼─────────────┼─────────────┤
│ hw:cycles │ 329,469 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ hw:cache_access │ 0 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ hw:cache_misses │ 0 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ hw:branches │ did not run │ 0 % │
├───────────────────────┼─────────────┼─────────────┤
│ hw:branch_mispredicts │ did not run │ 0 % │
├───────────────────────┼─────────────┼─────────────┤
│ hw:instructions │ did not run │ 0 % │
├───────────────────────┼─────────────┼─────────────┤
│ sw:ctx_switches │ 0 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ sw:page_faults │ 1 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ sw:minor_page_faults │ 1 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ sw:major_page_faults │ 0 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ sw:cpu_migrations │ 0 │ 100.0 % │
└───────────────────────┴─────────────┴─────────────┘
julia> @measure bar(A,B,C)
┌───────────────────────┬─────────────┬─────────────┐
│ │ Events │ Active Time │
├───────────────────────┼─────────────┼─────────────┤
│ hw:cycles │ 374,790 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ hw:cache_access │ 0 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ hw:cache_misses │ 0 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ hw:branches │ did not run │ 0 % │
├───────────────────────┼─────────────┼─────────────┤
│ hw:branch_mispredicts │ did not run │ 0 % │
├───────────────────────┼─────────────┼─────────────┤
│ hw:instructions │ did not run │ 0 % │
├───────────────────────┼─────────────┼─────────────┤
│ sw:ctx_switches │ 0 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ sw:page_faults │ 0 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ sw:minor_page_faults │ 0 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ sw:major_page_faults │ 0 │ 100.0 % │
├───────────────────────┼─────────────┼─────────────┤
│ sw:cpu_migrations │ 0 │ 100.0 % │
└───────────────────────┴─────────────┴─────────────┘
I omitted baz
for now. If you want, I can profile it too.
So assuming the number of snapshots is proportional to time/not too noisy, we have: Snapshots inside testallocation: foo: 154 bar: 200
Within mul!
:
foo: 49
bar: 37
So bar
's mul seems faster, perhaps because the reused matrix C
is fresh in cache?
broadcast: foo: 75 bar: 163
Shouldn't that be two separate calls to broadcast in foo
?
For fast functions, I would do
foreachf(f::F, N, args::Vararg{Any,A}) where {F,A} = foreach(_ -> f(args...), 1:N)
@measure foreachf(foo, 10, m, n, k) # compile
@measure foreachf(bar, 10, C, A, B) # compile
@measure foreachf(foo, 10_000, m, n, k)
@measure foreachf(bar, 10_000, C, A, B)
I also prefer the @pstats
macro over @measure
, but that requires the master branch.
julia> @measure foreachf(foo, 10_000, m, n, k)
┌───────────────────────┬───────────────┬─────────────┐
│ │ Events │ Active Time │
├───────────────────────┼───────────────┼─────────────┤
│ hw:cycles │ 4,599,882,778 │ 93.2 % │
├───────────────────────┼───────────────┼─────────────┤
│ hw:cache_access │ 0 │ 93.5 % │
├───────────────────────┼───────────────┼─────────────┤
│ hw:cache_misses │ 0 │ 93.5 % │
├───────────────────────┼───────────────┼─────────────┤
│ hw:branches │ 25,269,366 │ 13.3 % │
├───────────────────────┼───────────────┼─────────────┤
│ hw:branch_mispredicts │ 482,065 │ 13.3 % │
├───────────────────────┼───────────────┼─────────────┤
│ hw:instructions │ 391,034,223 │ 13.3 % │
├───────────────────────┼───────────────┼─────────────┤
│ sw:ctx_switches │ 0 │ 100.0 % │
├───────────────────────┼───────────────┼─────────────┤
│ sw:page_faults │ 49,892 │ 100.0 % │
├───────────────────────┼───────────────┼─────────────┤
│ sw:minor_page_faults │ 49,892 │ 100.0 % │
├───────────────────────┼───────────────┼─────────────┤
│ sw:major_page_faults │ 0 │ 100.0 % │
├───────────────────────┼───────────────┼─────────────┤
│ sw:cpu_migrations │ 0 │ 100.0 % │
└───────────────────────┴───────────────┴─────────────┘
and
julia> @measure foreachf(bar, 10_000, A, B, C)
┌───────────────────────┬───────────────┬─────────────┐
│ │ Events │ Active Time │
├───────────────────────┼───────────────┼─────────────┤
│ hw:cycles │ 7,286,116,993 │ 94.5 % │
├───────────────────────┼───────────────┼─────────────┤
│ hw:cache_access │ 0 │ 94.3 % │
├───────────────────────┼───────────────┼─────────────┤
│ hw:cache_misses │ 0 │ 94.3 % │
├───────────────────────┼───────────────┼─────────────┤
│ hw:branches │ 13,476,961 │ 11.2 % │
├───────────────────────┼───────────────┼─────────────┤
│ hw:branch_mispredicts │ 103,112 │ 11.2 % │
├───────────────────────┼───────────────┼─────────────┤
│ hw:instructions │ 287,561,123 │ 11.2 % │
├───────────────────────┼───────────────┼─────────────┤
│ sw:ctx_switches │ 0 │ 100.0 % │
├───────────────────────┼───────────────┼─────────────┤
│ sw:page_faults │ 0 │ 100.0 % │
├───────────────────────┼───────────────┼─────────────┤
│ sw:minor_page_faults │ 0 │ 100.0 % │
├───────────────────────┼───────────────┼─────────────┤
│ sw:major_page_faults │ 0 │ 100.0 % │
├───────────────────────┼───────────────┼─────────────┤
│ sw:cpu_migrations │ 0 │ 100.0 % │
└───────────────────────┴───────────────┴─────────────┘
Ok, I suppose page_faults
and branch_mispredicts
are bad and that's why bar
is faster than... wait. 🤔
What happens if you remove the mul!
, and just benchmark the broadcasting?
Fredrik Ekre suggested that before. It seems more reasonable:
julia> test(10,1000,100)
54.893 μs (5 allocations: 867.47 KiB)
14.907 μs (0 allocations: 0 bytes)
20.098 μs (0 allocations: 0 bytes)
Just an update. I came to this again and wondered whether it got better on new versions. Turns out, it got worse.
(using test(10, 1000, 100)
from the first post)
julia | foo | bar |
v1.5.4 | 112.360 μs (5 allocations: 867.47 KiB) | 197.449 μs (0 allocations: 0 bytes) |
v1.6.5 | 113.291 μs (5 allocations: 867.47 KiB) | 196.426 μs (0 allocations: 0 bytes) |
v1.7.2 | 194.573 μs (5 allocations: 867.41 KiB) | 223.136 μs (0 allocations: 0 bytes) |
v1.8.0-DEV.1505 | 278.971 μs (5 allocations: 867.41 KiB) | 353.469 μs (0 allocations: 0 bytes) |
Should I file this as a separate regression?
Test system is still the same:
julia> versioninfo()
Julia Version 1.8.0-DEV.1505
Commit e422590151 (2022-02-12 17:07 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 5 2600 Six-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.0 (ORCJIT, znver1)
Hi, while optimizing a function I found a case where reducing allocations (by providing a cache struct) actually made the function slower. The following minimal example:
produces this output on my work computer:
On my laptop I get much more intuitive results:
@MasonProtter could reproduce similar timings on his AMD machine. @giordano suggested it may be a code-generation/compiler bug, so I opened this issue.
Edit: Link to the Zulip discussion.