Allocating code faster than non-allocating

fkastner commented 3 years ago

Hi, while optimizing a function I found a case where reducing allocations (by providing a cache struct) actually made the function slower. The following minimal example:

using BenchmarkTools
using LinearAlgebra

function foo(m,n,k)
    A = Matrix{Float64}(undef, m, n)
    B = Matrix{Float64}(undef, n, k)
    C = Matrix{Float64}(undef, m, k)
    A .= 1.0
    B .= 1.0
    mul!(C, A, B)
    return C
end

function bar(A,B,C)
    A .= 1.0
    B .= 1.0
    mul!(C, A, B)
    return C
end

struct MyCache
    A::Array{Float64, 2}
    B::Array{Float64, 2}
    C::Array{Float64, 2}
    function MyCache(m,n,k)
        A = zeros(m,n)
        B = zeros(n,k)
        C = zeros(m,k)
        return new(A,B,C)
    end
end

function baz(cache::MyCache)
    cache.A .= 1.0
    cache.B .= 1.0
    mul!(cache.C, cache.A, cache.B)
    return cache.C
end

function test(m, n, k)
    A = zeros(m,n)
    B = zeros(n,k)
    C = zeros(m,k)
    mycache = MyCache(m, n, k)

    @btime foo($m, $n, $k)
    @btime bar($A, $B, $C)
    @btime baz($mycache)
    nothing
end

produces this output on my work computer:

julia> test(10, 1000, 100)
  94.676 μs (5 allocations: 867.47 KiB)
  143.076 μs (0 allocations: 0 bytes)
  126.666 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver1)

On my laptop I get much more intuitive results:

julia> test(10, 1000, 100)
  210.219 μs (5 allocations: 867.47 KiB)
  139.470 μs (0 allocations: 0 bytes)
  144.290 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-4210M CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)

@MasonProtter could reproduce similar timings on his AMD machine. @giordano suggested it may be a code-generation/compiler bug, so I opened this issue.

Edit: Link to the Zulip discussion.

chriselrod commented 3 years ago

Could you profile it? It should of course be the same mul! (and thus dgemm!) each time, and I'd think that is where it is spending most of the time.

giordano commented 3 years ago

Why should be bar slower than foo? I think the baz here is a red herring

Seelengrab commented 3 years ago

I cannot reproduce this on 1.6.0-rc1 (locally compiled)

julia> test(10,1000,100)
  180.400 μs (5 allocations: 867.47 KiB)
  117.700 μs (0 allocations: 0 bytes)
  115.300 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.6.0-rc1
Commit a58bdd9010* (2021-02-06 15:49 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PKG_SERVER =
  JULIA_NUM_THREADS = 4

julia> BLAS._get_num_threads()
4

Seems to be AMD specific.

chriselrod commented 3 years ago

Why should be bar slower than foo? I think the baz here is a red herring

Yes, that doesn't make sense to me. I thought profiling could confirm how much time dgemm! in particular is taking in each.

The preallocated version should also benefit from memory being hot in the cache. Maybe the hardware prefetcher throws things off.

Given that @fkastner is on Linux, using LinuxPerf to look at cache misses would also be helpful.

fkastner commented 3 years ago

@profile for _=1:1000; foo(m, n, k); end;

Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
  ╎154 @Base/client.jl:506; _start()
  ╎ 154 @Base/client.jl:296; exec_options(::Base.JLOptions)
  ╎  154 @Base/Base.jl:368; include(::Module, ::String)
  ╎   154 @Base/Base.jl:380; include(::Function, ::Module, ::String)
  ╎    154 ...linux64/build/usr/share/julia/stdlib/v1.5/Profile/src/Profile.jl:28; top-level scope
  ╎     154 /home/kastner/Dokumente/Julia/playground/testAllocation.jl:66; macro expansion
  ╎    ╎ 75  /home/kastner/Dokumente/Julia/playground/testAllocation.jl:9; foo(::Int64, ::Int64, ::Int64)
  ╎    ╎  75  @Base/broadcast.jl:845; materialize!
  ╎    ╎   75  @Base/broadcast.jl:848; materialize!
  ╎    ╎    75  @Base/broadcast.jl:894; copyto!
  ╎    ╎     75  @Base/array.jl:357; fill!
74╎    ╎    ╎ 75  @Base/array.jl:847; setindex!
  ╎    ╎ 49  /home/kastner/Dokumente/Julia/playground/testAllocation.jl:10; foo(::Int64, ::Int64, ::Int64)
  ╎    ╎  49  .../build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:208; mul!
  ╎    ╎   49  .../build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:169; mul!
  ╎    ╎    49  ...build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:597; gemm_wrapper!(::Array{Float64,2}, ::Char, ::Char, ::Array{Float64,2}, ::Array{Float64,2}, ::Li...
49╎    ╎     49  ...4/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/blas.jl:1374; gemm!(::Char, ::Char, ::Float64, ::Array{Float64,2}, ::Array{Float64,2}, ::Float64, ::Array{F...
Total snapshots: 172

and

@profile for _=1:1000; bar(A, B, C); end;

Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
   ╎200 @Base/client.jl:506; _start()
   ╎ 200 @Base/client.jl:296; exec_options(::Base.JLOptions)
   ╎  200 @Base/Base.jl:368; include(::Module, ::String)
   ╎   200 @Base/Base.jl:380; include(::Function, ::Module, ::String)
   ╎    200 ...inux64/build/usr/share/julia/stdlib/v1.5/Profile/src/Profile.jl:28; top-level scope
   ╎     200 /home/kastner/Dokumente/Julia/playground/testAllocation.jl:70; macro expansion
   ╎    ╎ 30  /home/kastner/Dokumente/Julia/playground/testAllocation.jl:15; bar(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,2})
   ╎    ╎  30  @Base/broadcast.jl:845; materialize!
   ╎    ╎   30  @Base/broadcast.jl:848; materialize!
   ╎    ╎    30  @Base/broadcast.jl:894; copyto!
   ╎    ╎     30  @Base/array.jl:357; fill!
 30╎    ╎    ╎ 30  @Base/array.jl:847; setindex!
   ╎    ╎ 133 /home/kastner/Dokumente/Julia/playground/testAllocation.jl:16; bar(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,2})
   ╎    ╎  133 @Base/broadcast.jl:845; materialize!
   ╎    ╎   133 @Base/broadcast.jl:848; materialize!
   ╎    ╎    133 @Base/broadcast.jl:894; copyto!
   ╎    ╎     133 @Base/array.jl:357; fill!
132╎    ╎    ╎ 133 @Base/array.jl:847; setindex!
   ╎    ╎ 37  /home/kastner/Dokumente/Julia/playground/testAllocation.jl:17; bar(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,2})
   ╎    ╎  37  .../build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:208; mul!
   ╎    ╎   37  ...build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:169; mul!
   ╎    ╎    37  ...build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:597; gemm_wrapper!(::Array{Float64,2}, ::Char, ::Char, ::Array{Float64,2}, ::Array{Float64,2}, ::L...
 37╎    ╎     37  .../build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/blas.jl:1374; gemm!(::Char, ::Char, ::Float64, ::Array{Float64,2}, ::Array{Float64,2}, ::Float64, ::Array{...
Total snapshots: 227

fkastner commented 3 years ago

julia> @measure foo(m,n,k)
┌───────────────────────┬─────────────┬─────────────┐
│                       │ Events      │ Active Time │
├───────────────────────┼─────────────┼─────────────┤
│             hw:cycles │ 329,469     │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│       hw:cache_access │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│       hw:cache_misses │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│           hw:branches │ did not run │ 0 %         │
├───────────────────────┼─────────────┼─────────────┤
│ hw:branch_mispredicts │ did not run │ 0 %         │
├───────────────────────┼─────────────┼─────────────┤
│       hw:instructions │ did not run │ 0 %         │
├───────────────────────┼─────────────┼─────────────┤
│       sw:ctx_switches │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│        sw:page_faults │ 1           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│  sw:minor_page_faults │ 1           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│  sw:major_page_faults │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│     sw:cpu_migrations │ 0           │ 100.0 %     │
└───────────────────────┴─────────────┴─────────────┘

julia> @measure bar(A,B,C)
┌───────────────────────┬─────────────┬─────────────┐
│                       │ Events      │ Active Time │
├───────────────────────┼─────────────┼─────────────┤
│             hw:cycles │ 374,790     │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│       hw:cache_access │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│       hw:cache_misses │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│           hw:branches │ did not run │ 0 %         │
├───────────────────────┼─────────────┼─────────────┤
│ hw:branch_mispredicts │ did not run │ 0 %         │
├───────────────────────┼─────────────┼─────────────┤
│       hw:instructions │ did not run │ 0 %         │
├───────────────────────┼─────────────┼─────────────┤
│       sw:ctx_switches │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│        sw:page_faults │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│  sw:minor_page_faults │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│  sw:major_page_faults │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│     sw:cpu_migrations │ 0           │ 100.0 %     │
└───────────────────────┴─────────────┴─────────────┘

I omitted baz for now. If you want, I can profile it too.

chriselrod commented 3 years ago

So assuming the number of snapshots is proportional to time/not too noisy, we have: Snapshots inside testallocation: foo: 154 bar: 200

Within mul!: foo: 49 bar: 37

So bar's mul seems faster, perhaps because the reused matrix C is fresh in cache?

broadcast: foo: 75 bar: 163

Shouldn't that be two separate calls to broadcast in foo?

For fast functions, I would do

foreachf(f::F, N, args::Vararg{Any,A}) where {F,A} = foreach(_ -> f(args...), 1:N)
@measure foreachf(foo, 10, m, n, k) # compile
@measure foreachf(bar, 10, C, A, B) # compile
@measure foreachf(foo, 10_000, m, n, k)
@measure foreachf(bar, 10_000, C, A, B)

I also prefer the @pstats macro over @measure, but that requires the master branch.

fkastner commented 3 years ago

julia> @measure foreachf(foo, 10_000, m, n, k)
┌───────────────────────┬───────────────┬─────────────┐
│                       │ Events        │ Active Time │
├───────────────────────┼───────────────┼─────────────┤
│             hw:cycles │ 4,599,882,778 │ 93.2 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:cache_access │ 0             │ 93.5 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:cache_misses │ 0             │ 93.5 %      │
├───────────────────────┼───────────────┼─────────────┤
│           hw:branches │ 25,269,366    │ 13.3 %      │
├───────────────────────┼───────────────┼─────────────┤
│ hw:branch_mispredicts │ 482,065       │ 13.3 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:instructions │ 391,034,223   │ 13.3 %      │
├───────────────────────┼───────────────┼─────────────┤
│       sw:ctx_switches │ 0             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│        sw:page_faults │ 49,892        │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│  sw:minor_page_faults │ 49,892        │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│  sw:major_page_faults │ 0             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│     sw:cpu_migrations │ 0             │ 100.0 %     │
└───────────────────────┴───────────────┴─────────────┘

and

julia> @measure foreachf(bar, 10_000, A, B, C)
┌───────────────────────┬───────────────┬─────────────┐
│                       │ Events        │ Active Time │
├───────────────────────┼───────────────┼─────────────┤
│             hw:cycles │ 7,286,116,993 │ 94.5 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:cache_access │ 0             │ 94.3 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:cache_misses │ 0             │ 94.3 %      │
├───────────────────────┼───────────────┼─────────────┤
│           hw:branches │ 13,476,961    │ 11.2 %      │
├───────────────────────┼───────────────┼─────────────┤
│ hw:branch_mispredicts │ 103,112       │ 11.2 %      │
├───────────────────────┼───────────────┼─────────────┤
│       hw:instructions │ 287,561,123   │ 11.2 %      │
├───────────────────────┼───────────────┼─────────────┤
│       sw:ctx_switches │ 0             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│        sw:page_faults │ 0             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│  sw:minor_page_faults │ 0             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│  sw:major_page_faults │ 0             │ 100.0 %     │
├───────────────────────┼───────────────┼─────────────┤
│     sw:cpu_migrations │ 0             │ 100.0 %     │
└───────────────────────┴───────────────┴─────────────┘

fkastner commented 3 years ago

Ok, I suppose page_faults and branch_mispredicts are bad and that's why bar is faster than... wait. 🤔

chriselrod commented 3 years ago

What happens if you remove the mul!, and just benchmark the broadcasting?

fkastner commented 3 years ago

Fredrik Ekre suggested that before. It seems more reasonable:

julia> test(10,1000,100)
  54.893 μs (5 allocations: 867.47 KiB)
  14.907 μs (0 allocations: 0 bytes)
  20.098 μs (0 allocations: 0 bytes)

fkastner commented 2 years ago

Just an update. I came to this again and wondered whether it got better on new versions. Turns out, it got worse. (using test(10, 1000, 100) from the first post)

julia	foo	bar
v1.5.4	112.360 μs (5 allocations: 867.47 KiB)	197.449 μs (0 allocations: 0 bytes)
v1.6.5	113.291 μs (5 allocations: 867.47 KiB)	196.426 μs (0 allocations: 0 bytes)
v1.7.2	194.573 μs (5 allocations: 867.41 KiB)	223.136 μs (0 allocations: 0 bytes)
v1.8.0-DEV.1505	278.971 μs (5 allocations: 867.41 KiB)	353.469 μs (0 allocations: 0 bytes)

Should I file this as a separate regression?

Test system is still the same:

julia> versioninfo()
Julia Version 1.8.0-DEV.1505
Commit e422590151 (2022-02-12 17:07 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, znver1)

JuliaLang / julia

Allocating code faster than non-allocating #39566