EnzymeAD / Enzyme.jl

Julia bindings for the Enzyme automatic differentiator
https://enzyme.mit.edu
MIT License
455 stars 63 forks source link

Benchmarking some very simple Flux models #2069

Open mcabbott opened 6 days ago

mcabbott commented 6 days ago

On some extremely simple Flux models, Enzyme seems to be slower than Zygote for me. What's going wrong here?

julia> using Flux, Enzyme, Test, BenchmarkTools

julia> mlp = Chain(Flux.flatten, Dense(28^2 => 32, tanh), Dense(32 => 10));

julia> img = rand32(28, 28, 1, 128);

julia> @inferred mlp(img);  # type-stable

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3]
3-element Vector{Float32}:
 -15.980308
   6.2900686
 -79.44746

julia> Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3]
3-element Vector{Float32}:
 -15.980312
   6.2900686
 -79.44745

julia> @btime $mlp($img);
  min 10.958 μs, mean 14.119 μs (6 allocations, 43.09 KiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img);
  min 38.250 μs, mean 67.356 μs (86 allocations, 596.27 KiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img);
  min 75.125 μs, mean 119.919 μs (55 allocations, 579.61 KiB)

# a slightly bigger model

julia> lenet = Chain(  # from the model zoo
           Conv((5, 5), 1=>6, relu),
           MaxPool((2, 2)),
           Conv((5, 5), 6=>16, relu),
           MaxPool((2, 2)),
           Flux.flatten,
           Dense(256 => 120, relu),
           Dense(120 => 84, relu), 
           Dense(84 => 10),
       );

julia> @inferred lenet(img);  # type-stable

julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
6-element Vector{Float32}:
 10.119315
  0.0
...

julia> Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias
6-element Vector{Float32}:
 10.119322
  0.0
...

julia> @btime $lenet($img);
  min 655.583 μs, mean 1.107 ms (160 allocations, 5.60 MiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 4.979 ms, mean 6.300 ms (558 allocations, 14.18 MiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 8.347 ms, mean 9.752 ms (538 allocations, 15.42 MiB)

# tweak Enzyme to see if details matter...

julia> tmp_loss(m,x) = sum(abs2, m(x));  # give it a name

julia> @btime Enzyme.gradient(Reverse, tmp_loss, $lenet, $img);
  min 8.260 ms, mean 9.766 ms (538 allocations, 15.42 MiB)

julia> @btime Enzyme.gradient(Reverse, tmp_loss, $lenet, Const($img));
  min 8.030 ms, mean 9.235 ms (479 allocations, 14.75 MiB)

julia> @btime Enzyme.autodiff(Reverse, tmp_loss, Active, $(Duplicated(lenet, deepcopy(lenet))), Const($img));
  min 7.642 ms, mean 8.638 ms (359 allocations, 14.57 MiB)

Versions:

(jl_w98UzC) pkg> st Enzyme
Status `/private/var/folders/yq/4p2zwd614y59gszh7y9ypyhh0000gn/T/jl_w98UzC/Project.toml`
  [7da242da] Enzyme v0.13.14

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 11 × Apple M3 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 4 default, 0 interactive, 2 GC (on 5 virtual cores)
Environment:
  JULIA_NUM_THREADS = 4
mcabbott commented 6 days ago

Trying this on another computer, with Julia 1.11, I see similar slowdown on the small model, and a failure on the larger one.

```julia julia> @btime $mlp($img); 173.251 μs (13 allocations: 42.36 KiB) julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img); 494.602 μs (69 allocations: 588.97 KiB) julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img); 884.058 μs (91 allocations: 586.92 KiB) # Larger model fails: julia> Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias ERROR: No create nofree of empty function (julia.gc_loaded) julia.gc_loaded) at context: call fastcc void @julia__PoolDims_14_107488({ [2 x i64], [2 x i64], i64, [2 x i64], [4 x i64], [2 x i64] }* noalias nocapture nofree noundef nonnull writeonly sret({ [2 x i64], [2 x i64], i64, [2 x i64], [4 x i64], [2 x i64] }) align 8 dereferenceable(104) %5, [2 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(64) %35, [4 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(96) %34, [4 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(32) %44, [2 x i64] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(112) %36) #268, !dbg !297 (julia__PoolDims_14_107488) Stacktrace: [1] PoolDims @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:20 [2] PoolDims @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:43 [3] MaxPool @ ~/.julia/packages/Flux/htpCe/src/layers/conv.jl:728 [4] macro expansion @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53 [5] _applychain @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53 Stacktrace: [1] PoolDims @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:20 [inlined] [2] PoolDims @ ~/.julia/packages/NNlib/CkJqS/src/dim_helpers/PoolDims.jl:43 [inlined] [3] MaxPool @ ~/.julia/packages/Flux/htpCe/src/layers/conv.jl:728 [inlined] [4] macro expansion @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53 [inlined] [5] _applychain @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:53 [6] Chain @ ~/.julia/packages/Flux/htpCe/src/layers/basic.jl:51 [inlined] [7] #19 @ ./REPL[31]:1 [inlined] [8] diffejulia__19_105996_inner_242wrap @ ./REPL[31]:0 [9] macro expansion @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:8305 [inlined] [10] enzyme_call @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7868 [inlined] [11] CombinedAdjointThunk @ ~/.julia/packages/Enzyme/RvNgp/src/compiler.jl:7641 [inlined] [12] autodiff @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:491 [inlined] [13] autodiff @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:512 [inlined] [14] macro expansion @ ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:1678 [inlined] [15] gradient(rm::ReverseMode{…}, f::var"#19#20", x::Chain{…}, args::Array{…}) @ Enzyme ~/.julia/packages/Enzyme/RvNgp/src/Enzyme.jl:1661 [16] top-level scope @ REPL[31]:1 Some type information was truncated. Use `show(err)` to see complete types. (jl_KEzUxT) pkg> st Enzyme Status `/tmp/jl_KEzUxT/Project.toml` [7da242da] Enzyme v0.13.14 julia> versioninfo() Julia Version 1.11.1 Commit 8f5b7ca12ad (2024-10-16 10:53 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 12 × Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz WORD_SIZE: 64 LLVM: libLLVM-16.0.6 (ORCJIT, broadwell) Threads: 4 default, 0 interactive, 2 GC (on 12 virtual cores) Environment: JULIA_NUM_THREADS = 4 ```
wsmoses commented 2 days ago

Sorry finally getting around to this.

So for the first case, I don't see that much of a gap (though definitely it would be good to improve):


julia> fn(m, x) = sum(abs2, m(x))
fn (generic function with 2 methods)

julia> @btime $fn($mlp, $img);

  250.295 μs (6 allocations: 42.59 KiB)

julia> @btime Flux.gradient($fn, $mlp, $img);
  713.314 μs (84 allocations: 595.17 KiB)

julia> dmlp = Enzyme.make_zero(mlp);

julia> dimg = Enzyme.make_zero(img);

julia> @btime Enzyme.autodiff(Reverse, $fn, $(Duplicated(mlp, dmlp)), $(Duplicated(img, dimg)));
  800.866 μs (11 allocations: 85.16 KiB)
mcabbott commented 2 days ago

Surprised how different those numbers are. I realised I have AppleAccelerate loaded, and if I run with --startup-file=no to use OpenBLAS instead, the relative difference is much smaller. (In fact the absolute difference is almost cut in half too.)

julia> @btime $mlp($img);
  min 104.833 μs, mean 109.179 μs (6 allocations, 43.09 KiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img);  # Zygote, allocating
  min 243.792 μs, mean 305.012 μs (84 allocations, 596.17 KiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img);  # allocating
  min 266.292 μs, mean 329.010 μs (55 allocations, 579.61 KiB)

julia> @btime Enzyme.autodiff(Reverse, $((m,x) -> sum(abs2, m(x))), $(Duplicated(mlp, Enzyme.make_zero(mlp))), $(Duplicated(img, Enzyme.make_zero(img))));  # pre-allocated
  min 256.916 μs, mean 270.453 μs (11 allocations, 86.16 KiB)

(Same machine & versions as above.)

wsmoses commented 2 days ago

huh, so what exactly causes it to be slow. AppleAccelerate itself?

mcabbott commented 2 days ago

Don't know. For the other model, changing to OpenBlas gives a slightly larger time-difference instead. (And a slightly smaller ratio).

julia> @btime $lenet($img);  # was min 655.583 μs, mean 1.107 ms with AppleAccelerate above
  min 839.916 μs, mean 1.910 ms (160 allocations, 5.60 MiB)

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 7.980 ms, mean 9.273 ms (556 allocations, 14.18 MiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 11.960 ms, mean 13.037 ms (538 allocations, 15.42 MiB)

julia> @btime Enzyme.autodiff(Reverse, $((m,x) -> sum(abs2, m(x))), $(Duplicated(lenet, Enzyme.make_zero(lenet))), $(Duplicated(img, Enzyme.make_zero(img))));
  min 12.017 ms, mean 13.615 ms (415 allocations, 14.85 MiB)

The times here: https://github.com/EnzymeAD/Enzyme.jl/issues/2069#issuecomment-2460867943 on a different computer also don't involve AppleAccelerate.