performance Dense layer on CPU

CarloLucibello commented 3 years ago

This is just to track the performance of the Dense layer on CPU. I use the following script:

using BenchmarkTools, Flux
using Zygote: pullback

using LinearAlgebra
BLAS.set_num_threads(1)

function perf_test(n)
    r = rand(Float32, n, n, relu) 
    d = Dense(n, n)
    println("  FORW")
    @btime sum($d($r))
    println("  GRADIENT")
    @btime gradient(() -> sum($d($r)), $(Flux.params(d)))
    @btime gradient((d) -> sum(d($r)), $d)
    println("  PULLBACK")
    y, back =  pullback((d) -> sum(d(r)), d)
    @btime pullback((d) -> sum(d($r)), $d)
    @btime $back(1f0)
end

println("SMALL NET n=2")
perf_test(2)
println("MEDIUM NET n=20")
perf_test(20)
println("LARGE NET n=200")
perf_test(200)
println("VERY LARGE NET n=2000")
perf_test(2000)

and on my system:

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1* (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.1 (ORCJIT, skylake)

(Flux) pkg> st
Project Flux v0.12.0-dev
Status `~/.julia/dev/Flux/Project.toml`
  [1520ce14] AbstractTrees v0.3.3
  [79e6a3ab] Adapt v2.3.0
  [052768ef] CUDA v2.3.0
  [944b1d66] CodecZlib v0.7.0
  [5ae59095] Colors v0.12.4
  [d9f16b24] Functors v0.1.0
  [e5e0dc1b] Juno v0.8.4
  [1914dd2f] MacroTools v0.5.6
  [872c559c] NNlib v0.7.7
  [189a3867] Reexport v0.2.0
  [2913bbd2] StatsBase v0.33.2
  [a5390f91] ZipFile v0.9.3
  [e88e6eb3] Zygote v0.5.15
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [44cfe95a] Pkg
  [de0858da] Printf
  [9a3f8284] Random
  [ea8e919c] SHA
  [10745b16] Statistics
  [8dfed614] Test

I obtain the following output

SMALL NET n=2
  FORW
  99.930 ns (2 allocations: 192 bytes)
  GRADIENT
  2.096 μs (40 allocations: 2.92 KiB)
  1.045 μs (31 allocations: 1.77 KiB)
  PULLBACK
  167.077 ns (5 allocations: 512 bytes)
  814.164 ns (24 allocations: 928 bytes)
MEDIUM NET n=20
  FORW
  1.049 μs (2 allocations: 3.53 KiB)
  GRADIENT
  5.334 μs (38 allocations: 12.95 KiB)
  4.222 μs (31 allocations: 11.86 KiB)
  PULLBACK
  1.383 μs (5 allocations: 5.52 KiB)
  2.747 μs (24 allocations: 5.98 KiB)
LARGE NET n=200
  FORW
  205.632 μs (4 allocations: 312.66 KiB)
  GRADIENT
  643.443 μs (44 allocations: 941.05 KiB)
  626.491 μs (37 allocations: 939.95 KiB)
  PULLBACK
  219.883 μs (8 allocations: 469.20 KiB)
  405.049 μs (27 allocations: 470.39 KiB)
VERY LARGE NET n=2000
  FORW
  214.841 ms (4 allocations: 30.52 MiB)
  GRADIENT
  637.410 ms (44 allocations: 91.56 MiB)
  637.142 ms (37 allocations: 91.56 MiB)
  PULLBACK
  217.240 ms (8 allocations: 45.78 MiB)
  418.468 ms (27 allocations: 45.78 MiB)

Some observations:

the expected O(n^3) asymptotic scaling only kicks in at the largest sizes
the pullback is ~2x slower than the forward (and even slower at very small sizes)
for the smallest networks, there is a significant speed difference between the 2 grandient's calling styles.

CarloLucibello commented 3 years ago

related issues I could find are #1307 #1273

mcabbott commented 3 years ago

Maybe this should use Dense(n, n, relu), as gradient(sum, rand(Float32, 3))[1] isa Zygote.Fill which I think gets you a generic * (but wouldn't happen in real use), while gradient(x->sum(relu, x), rand(Float32, 3))[1] isa Array. This shaves off an order of magnitude at n=200.

CarloLucibello commented 3 years ago

Good catch! I'm updating the script with relu (and the output as well).

As you said, that FillArrays performance problem is not relevant in our real scenarios. Nonetheless, I wrote a fix here https://github.com/JuliaArrays/FillArrays.jl/pull/129

FluxML / Flux.jl

performance Dense layer on CPU #1414