Open mcabbott opened 6 days ago
Trying this on another computer, with Julia 1.11, I see similar slowdown on the small model, and a failure on the larger one.
Sorry finally getting around to this.
So for the first case, I don't see that much of a gap (though definitely it would be good to improve):
julia> fn(m, x) = sum(abs2, m(x))
fn (generic function with 2 methods)
julia> @btime $fn($mlp, $img);
250.295 μs (6 allocations: 42.59 KiB)
julia> @btime Flux.gradient($fn, $mlp, $img);
713.314 μs (84 allocations: 595.17 KiB)
julia> dmlp = Enzyme.make_zero(mlp);
julia> dimg = Enzyme.make_zero(img);
julia> @btime Enzyme.autodiff(Reverse, $fn, $(Duplicated(mlp, dmlp)), $(Duplicated(img, dimg)));
800.866 μs (11 allocations: 85.16 KiB)
Surprised how different those numbers are. I realised I have AppleAccelerate loaded, and if I run with --startup-file=no
to use OpenBLAS instead, the relative difference is much smaller. (In fact the absolute difference is almost cut in half too.)
julia> @btime $mlp($img);
min 104.833 μs, mean 109.179 μs (6 allocations, 43.09 KiB)
julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $mlp, $img); # Zygote, allocating
min 243.792 μs, mean 305.012 μs (84 allocations, 596.17 KiB)
julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $mlp, $img); # allocating
min 266.292 μs, mean 329.010 μs (55 allocations, 579.61 KiB)
julia> @btime Enzyme.autodiff(Reverse, $((m,x) -> sum(abs2, m(x))), $(Duplicated(mlp, Enzyme.make_zero(mlp))), $(Duplicated(img, Enzyme.make_zero(img)))); # pre-allocated
min 256.916 μs, mean 270.453 μs (11 allocations, 86.16 KiB)
(Same machine & versions as above.)
huh, so what exactly causes it to be slow. AppleAccelerate itself?
Don't know. For the other model, changing to OpenBlas gives a slightly larger time-difference instead. (And a slightly smaller ratio).
julia> @btime $lenet($img); # was min 655.583 μs, mean 1.107 ms with AppleAccelerate above
min 839.916 μs, mean 1.910 ms (160 allocations, 5.60 MiB)
julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $lenet, $img);
min 7.980 ms, mean 9.273 ms (556 allocations, 14.18 MiB)
julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $lenet, $img);
min 11.960 ms, mean 13.037 ms (538 allocations, 15.42 MiB)
julia> @btime Enzyme.autodiff(Reverse, $((m,x) -> sum(abs2, m(x))), $(Duplicated(lenet, Enzyme.make_zero(lenet))), $(Duplicated(img, Enzyme.make_zero(img))));
min 12.017 ms, mean 13.615 ms (415 allocations, 14.85 MiB)
The times here: https://github.com/EnzymeAD/Enzyme.jl/issues/2069#issuecomment-2460867943 on a different computer also don't involve AppleAccelerate.
On some extremely simple Flux models, Enzyme seems to be slower than Zygote for me. What's going wrong here?
Versions: