Closed VarLad closed 4 months ago
Thanks for opening this @VarLad -- there's an awfully high compile time here. @gdalle I suspect this isn't quite how the DI interface is meant to be used --could you comment?
Part of it is an issue with the way you use DifferentiationInterface. For a given pair (f, backend)
, you should always prepare the differential operator in advance so that subsequent computations are faster.
See the following parts of the documentation:
For Zygote, this makes absolutely no difference at the moment, since the preparation step does nothing:
But for Tapir it is crucial, since the preparation step is where the differentiation rule for your function is created with Tapir.build_rrule
(prepare_gradient
calls prepare_pullback
in this case):
Thus the correct way to benchmark looks something like
using Chairmarks, DifferentiationInterface
import Tapir, Zygote
f(x) = sqrt(sum(abs2, x) * sum(abs2, x))
@b (
x=rand(10000),
extras=prepare_gradient(f, AutoTapir(), rand(10000))
) value_and_gradient(f, AutoTapir(), _.x, _.extras) evals=1
@b (
x=rand(10000),
extras=prepare_gradient(f, AutoZygote(), rand(10000))
) value_and_gradient(f, AutoZygote(), _.x, _.extras) evals=1
I defined the function, the x
and the extras
outside of the benchmark because that is not what we want to measure: ideally you will compute many different gradients with the same f
and extras
(but possibly different x
s).
Unfortunately, on my laptop I still see huge compilation time when I do that, so I would guess it's up to Tapir then
I think a fair comparison should include the preparation time as well.
Same debate as whether or not we include compilation time when benchmarking Julia against other languages 🤷
With DifferentiationInterfaceTest's benchmarking utilities, everything is measured, so you can decide by yourself. Here I'm assuming that we want to compute many gradients. If it's a debatable assumption, then we should indeed adapt the benchmark.
Unfortunately, on my laptop I still see huge compilation time when I do that, so I would guess it's up to Tapir then
I think it has something to do with the sum
operator. Take a look at this:
julia> function mysum(x)
s = zero(eltype(x))
for i in eachindex(x)
s += x[i]
end
return s
end
mysum (generic function with 1 method)
julia> @b (
x=rand(10000),
extras=prepare_gradient(sum, AutoTapir(), rand(10000))
) value_and_gradient(sum, AutoTapir(), _.x, _.extras)
597.588 ms (1188425 allocs: 75.311 MiB, 2.54% gc time, 68.11% compile time)
julia> @b (
x=rand(10000),
extras=prepare_gradient(mysum, AutoTapir(), rand(10000))
) value_and_gradient(mysum, AutoTapir(), _.x, _.extras)
1.031 ms (19 allocs: 1.110 MiB)
I can reproduce the same. There's something wrong with the sum
operator and Tapir.
And this shows in both the code with preparation and without.
Although, even with the provided example, Enzyme's results show that there's a lot of performance left to be optimized for :eyes:
I think a fair comparison should include the preparation time as well.
@CarloLucibello I don't think we need to frame this in terms of fairness, just in terms of what one is interested in. I personally am only really interested (at present) in optimising Tapir for runtime performance, subject to the constraint that prep + compile time shouldn't be unacceptably large (for some vague qualitative notion of acceptable e.g. it shouldn't take an hour for your code to compile). Consequently (at present) I only really care about benchmarks that exclude prep + compile time, and of reports of "silly" prep + compile times. I of course understand that you may be interested in different use cases than I, and therefore be more interested in prep + compile time, and that's completely fine.
Unfortunately, on my laptop I still see huge compilation time when I do that, so I would guess it's up to Tapir then
I can reproduce the same. There's something wrong with the sum operator and Tapir.
@gdalle @VarLad I don't believe these assessments are correct -- it appears to be to do with how Chairmarks treats extras
:
]add Chairmarks DifferentiationInterface Tapir Zygote
using Chairmarks, DifferentiationInterface
import Tapir, Zygote
f(x) = sqrt(sum(abs2, x) * sum(abs2, x))
@b (
x=rand(10000),
extras=prepare_gradient(f, AutoTapir(), rand(10000)) ,
) value_and_gradient(f, AutoTapir(), _.x, _.extras) evals=1
# This one still seems to include set up time -- seems like the time to compute extras is included
# 511.842 ms (2207913 allocs: 140.952 MiB, 2.73% gc time, 67.73% compile time)
extras = prepare_gradient(f, AutoTapir(), rand(10000))
@b (
x=rand(10000),
extras=extras,
) value_and_gradient(f, AutoTapir(), _.x, _.extras) evals=1
# Extras computation is excluded
# 205.653 μs (2 allocs: 78.172 KiB)
Tapir is still the worst of the bunch, but we don't have a rule for sum
-- we just differentiate through the optimised IR. Enzyme definitely generates better code in this instance than Tapir currently does, and Zygote is better than either Enzyme or Tapir here because it has a rule specifically for sum
.
Simiarly, for mysum
:
function mysum(x)
s = zero(eltype(x))
for i in eachindex(x)
s += x[i]
end
return s
end
extras = prepare_gradient(mysum, AutoTapir(), rand(10000))
@b (
x=rand(10000),
extras=extras,
) value_and_gradient(mysum, AutoTapir(), _.x, _.extras)
# 74.769 μs (2 allocs: 78.172 KiB)
extras = prepare_gradient(mysum, AutoZygote(), rand(10000))
@b (
x=rand(10000),
extras=extras,
) value_and_gradient(mysum, AutoZygote(), _.x, _.extras)
# 391.253 ms (350098 allocs: 778.816 MiB, 45.83% gc time)
I would imagine that Enzyme will do really very well on this example.
I'm very puzzled by this behavior of Chairmarks. Normally the setup code shouldn't be included in the measurements. Maybe @LilithHafner can help shed some light?
Ohh I just had a look at what Chairmarks is doing via @macroexpand
, and I now understand what's happening.
There's a difference between the performance of Tapir the first time you run it, and the the second time you run it. The first time you run it, a bunch of allocations happen (various stacks get allocated etc. There's also some actual compilation which gets interleaved with running code). Provided you have the same control flow on the second run, said allocations won't happen on the second time you run it (if control flow does change, additionall allocations might have to happen).
The following ought to work:
]add Chairmarks DifferentiationInterface Tapir Zygote
using Chairmarks, DifferentiationInterface
import Tapir, Zygote
f(x) = sqrt(sum(abs2, x) * sum(abs2, x))
function prep_grad(f, ::AutoTapir, x)
extras = prepare_gradient(f, AutoTapir(), rand(10000))
value_and_gradient(f, AutoTapir(), x, extras)
return extras
end
@b (
x=rand(10000),
extras=prep_grad(f, AutoTapir(), rand(10000)) ,
) value_and_gradient(f, AutoTapir(), _.x, _.extras) evals=1
# 220.602 μs (2 allocs: 78.172 KiB)
So it is a Tapir.jl thing, but not quite what we thought. The runtime performance is good, just not great the first time you run it.
edit: this issue does raise the question of how best to communicate the correct way to benchmark Tapir performance. @gdalle I feel like we've had a few of these kinds of problems now, where Tapir is doing the right thing in a performant way, it's just that we've been using it incorrectly haha
So it is a Tapir.jl thing, but not quite what we thought. The runtime performance is good, just not great the first time you run it.
Presumably the overhead on the first run _even after build_rrule
_ is still big enough that Chairmarks doesn't bother running the code several times, and just returns that one run.
edit: this issue does raise the question of how best to communicate the correct way to benchmark Tapir performance.
Well, in the spirit of "prepare once, run a thousand times" (which is the one adopted by DI), I guess we should add one call to the operator inside the preparation workflow?
Well, in the spirit of "prepare once, run a thousand times" (which is the one adopted by DI), I guess we should add one call to the operator inside the preparation workflow?
I think this is probably the right thing to do -- if it causes problems for some reason, then we can always revert it at a later point in time.
Can you help out with that one? https://github.com/gdalle/DifferentiationInterface.jl/pull/254
I believe that if you follow @gdalle 's original approach to benchmarking here you should find that it works nicely now. If you could confirm that it indeed performs as expected for you, I'll close this issue.
And if you use the main branch of DifferentiationInterface. Release coming tomorrow or so
@willtebbutt it might be worth mentioning the preparation step in the readme when you discuss the DI bindings?
Presumably the overhead on the first run even after
build_rrule
is still big enough that Chairmarks doesn't bother running the code several times, and just returns that one run.
This is correct. Chairmarks and BenchmarkTools both make the (in this case false) assumption that sample runtime is equal to evaluation runtime times number of evaluations, and reports the extrapolated evaluation runtime. I'll look into providing better results in this case.
]add Chairmarks DifferentiationInterface Tapir
using Chairmarks, DifferentiationInterface
import Tapir
f(x) = sqrt(sum(abs2, x) * sum(abs2, x))
@b (
x=rand(10000),
extras=prepare_gradient(sum, AutoTapir(safe_mode=false), rand(10000))
) value_and_gradient(sum, AutoTapir(), _.x, _.extras)
I just ran the above and got
94.448 μs (2 allocs: 78.172 KiB)
which is what I would expect. I'm using versions:
[0ca39b1e] Chairmarks v1.2.1
[a0c0ee7d] DifferentiationInterface v0.4.0
[07d77754] Tapir v0.2.12
Consequently I'm going to mark this as resolved. Thanks all for contributing to this -- it's been very useful to iron this problem out!
Thanks for this! While using this with with DifferentiationInterface, I get worse performance numbers. Thought reporting it here would make sense.
Do tell if its a DifferentiationInterface.jl issue. This is on Linux on Julia 1.10