compintell / Mooncake.jl

https://compintell.github.io/Mooncake.jl/
MIT License
102 stars 3 forks source link

Performance issues #140

Closed VarLad closed 4 months ago

VarLad commented 4 months ago

Thanks for this! While using this with with DifferentiationInterface, I get worse performance numbers. Thought reporting it here would make sense.

image

Do tell if its a DifferentiationInterface.jl issue. This is on Linux on Julia 1.10

willtebbutt commented 4 months ago

Thanks for opening this @VarLad -- there's an awfully high compile time here. @gdalle I suspect this isn't quite how the DI interface is meant to be used --could you comment?

gdalle commented 4 months ago

Part of it is an issue with the way you use DifferentiationInterface. For a given pair (f, backend), you should always prepare the differential operator in advance so that subsequent computations are faster. See the following parts of the documentation:

For Zygote, this makes absolutely no difference at the moment, since the preparation step does nothing:

https://github.com/gdalle/DifferentiationInterface.jl/blob/16d93ef4111e1c196912a3dd53ffb31e04445324/DifferentiationInterface/ext/DifferentiationInterfaceZygoteExt/DifferentiationInterfaceZygoteExt.jl#L38

But for Tapir it is crucial, since the preparation step is where the differentiation rule for your function is created with Tapir.build_rrule (prepare_gradient calls prepare_pullback in this case):

https://github.com/gdalle/DifferentiationInterface.jl/blob/16d93ef4111e1c196912a3dd53ffb31e04445324/DifferentiationInterface/ext/DifferentiationInterfaceTapirExt/onearg.jl#L6-L9

Thus the correct way to benchmark looks something like

using Chairmarks, DifferentiationInterface
import Tapir, Zygote

f(x) = sqrt(sum(abs2, x) * sum(abs2, x))

@b (
    x=rand(10000),
    extras=prepare_gradient(f, AutoTapir(), rand(10000)) 
) value_and_gradient(f, AutoTapir(), _.x, _.extras) evals=1

@b (
    x=rand(10000),
    extras=prepare_gradient(f, AutoZygote(), rand(10000))
 ) value_and_gradient(f, AutoZygote(), _.x, _.extras) evals=1

I defined the function, the x and the extras outside of the benchmark because that is not what we want to measure: ideally you will compute many different gradients with the same f and extras (but possibly different xs).

Unfortunately, on my laptop I still see huge compilation time when I do that, so I would guess it's up to Tapir then

CarloLucibello commented 4 months ago

I think a fair comparison should include the preparation time as well.

gdalle commented 4 months ago

Same debate as whether or not we include compilation time when benchmarking Julia against other languages 🤷

With DifferentiationInterfaceTest's benchmarking utilities, everything is measured, so you can decide by yourself. Here I'm assuming that we want to compute many gradients. If it's a debatable assumption, then we should indeed adapt the benchmark.

gdalle commented 4 months ago

Unfortunately, on my laptop I still see huge compilation time when I do that, so I would guess it's up to Tapir then

I think it has something to do with the sum operator. Take a look at this:

julia> function mysum(x)
           s = zero(eltype(x))
           for i in eachindex(x)
               s += x[i]
           end
           return s
       end
mysum (generic function with 1 method)

julia> @b (
           x=rand(10000),
           extras=prepare_gradient(sum, AutoTapir(), rand(10000)) 
       ) value_and_gradient(sum, AutoTapir(), _.x, _.extras)
597.588 ms (1188425 allocs: 75.311 MiB, 2.54% gc time, 68.11% compile time)

julia> @b (
           x=rand(10000),
           extras=prepare_gradient(mysum, AutoTapir(), rand(10000)) 
       ) value_and_gradient(mysum, AutoTapir(), _.x, _.extras)
1.031 ms (19 allocs: 1.110 MiB)
VarLad commented 4 months ago

I can reproduce the same. There's something wrong with the sum operator and Tapir. And this shows in both the code with preparation and without. Although, even with the provided example, Enzyme's results show that there's a lot of performance left to be optimized for :eyes:

image

willtebbutt commented 4 months ago

I think a fair comparison should include the preparation time as well.

@CarloLucibello I don't think we need to frame this in terms of fairness, just in terms of what one is interested in. I personally am only really interested (at present) in optimising Tapir for runtime performance, subject to the constraint that prep + compile time shouldn't be unacceptably large (for some vague qualitative notion of acceptable e.g. it shouldn't take an hour for your code to compile). Consequently (at present) I only really care about benchmarks that exclude prep + compile time, and of reports of "silly" prep + compile times. I of course understand that you may be interested in different use cases than I, and therefore be more interested in prep + compile time, and that's completely fine.

Unfortunately, on my laptop I still see huge compilation time when I do that, so I would guess it's up to Tapir then

I can reproduce the same. There's something wrong with the sum operator and Tapir.

@gdalle @VarLad I don't believe these assessments are correct -- it appears to be to do with how Chairmarks treats extras:

]add Chairmarks DifferentiationInterface Tapir Zygote
using Chairmarks, DifferentiationInterface
import Tapir, Zygote

f(x) = sqrt(sum(abs2, x) * sum(abs2, x))

@b (
    x=rand(10000),
    extras=prepare_gradient(f, AutoTapir(), rand(10000)) ,
) value_and_gradient(f, AutoTapir(), _.x, _.extras) evals=1

# This one still seems to include set up time -- seems like the time to compute extras is included
# 511.842 ms (2207913 allocs: 140.952 MiB, 2.73% gc time, 67.73% compile time)

extras = prepare_gradient(f, AutoTapir(), rand(10000)) 
@b (
    x=rand(10000),
    extras=extras,
) value_and_gradient(f, AutoTapir(), _.x, _.extras) evals=1

# Extras computation is excluded
# 205.653 μs (2 allocs: 78.172 KiB)

Tapir is still the worst of the bunch, but we don't have a rule for sum -- we just differentiate through the optimised IR. Enzyme definitely generates better code in this instance than Tapir currently does, and Zygote is better than either Enzyme or Tapir here because it has a rule specifically for sum.

Simiarly, for mysum:

function mysum(x)
     s = zero(eltype(x))
     for i in eachindex(x)
         s += x[i]
     end
     return s
 end

extras = prepare_gradient(mysum, AutoTapir(), rand(10000)) 

@b (
     x=rand(10000),
     extras=extras,
 ) value_and_gradient(mysum, AutoTapir(), _.x, _.extras)

# 74.769 μs (2 allocs: 78.172 KiB)

extras = prepare_gradient(mysum, AutoZygote(), rand(10000)) 

@b (
     x=rand(10000),
     extras=extras,
 ) value_and_gradient(mysum, AutoZygote(), _.x, _.extras)

# 391.253 ms (350098 allocs: 778.816 MiB, 45.83% gc time)

I would imagine that Enzyme will do really very well on this example.

gdalle commented 4 months ago

I'm very puzzled by this behavior of Chairmarks. Normally the setup code shouldn't be included in the measurements. Maybe @LilithHafner can help shed some light?

willtebbutt commented 4 months ago

Ohh I just had a look at what Chairmarks is doing via @macroexpand, and I now understand what's happening.

There's a difference between the performance of Tapir the first time you run it, and the the second time you run it. The first time you run it, a bunch of allocations happen (various stacks get allocated etc. There's also some actual compilation which gets interleaved with running code). Provided you have the same control flow on the second run, said allocations won't happen on the second time you run it (if control flow does change, additionall allocations might have to happen).

The following ought to work:

]add Chairmarks DifferentiationInterface Tapir Zygote
using Chairmarks, DifferentiationInterface
import Tapir, Zygote

f(x) = sqrt(sum(abs2, x) * sum(abs2, x))

function prep_grad(f, ::AutoTapir, x)
    extras = prepare_gradient(f, AutoTapir(), rand(10000)) 
    value_and_gradient(f, AutoTapir(), x, extras)
    return extras
end

@b (
    x=rand(10000),
    extras=prep_grad(f, AutoTapir(), rand(10000)) ,
) value_and_gradient(f, AutoTapir(), _.x, _.extras) evals=1

# 220.602 μs (2 allocs: 78.172 KiB)

So it is a Tapir.jl thing, but not quite what we thought. The runtime performance is good, just not great the first time you run it.

edit: this issue does raise the question of how best to communicate the correct way to benchmark Tapir performance. @gdalle I feel like we've had a few of these kinds of problems now, where Tapir is doing the right thing in a performant way, it's just that we've been using it incorrectly haha

gdalle commented 4 months ago

So it is a Tapir.jl thing, but not quite what we thought. The runtime performance is good, just not great the first time you run it.

Presumably the overhead on the first run _even after build_rrule_ is still big enough that Chairmarks doesn't bother running the code several times, and just returns that one run.

edit: this issue does raise the question of how best to communicate the correct way to benchmark Tapir performance.

Well, in the spirit of "prepare once, run a thousand times" (which is the one adopted by DI), I guess we should add one call to the operator inside the preparation workflow?

willtebbutt commented 4 months ago

Well, in the spirit of "prepare once, run a thousand times" (which is the one adopted by DI), I guess we should add one call to the operator inside the preparation workflow?

I think this is probably the right thing to do -- if it causes problems for some reason, then we can always revert it at a later point in time.

gdalle commented 4 months ago

Can you help out with that one? https://github.com/gdalle/DifferentiationInterface.jl/pull/254

willtebbutt commented 4 months ago

I believe that if you follow @gdalle 's original approach to benchmarking here you should find that it works nicely now. If you could confirm that it indeed performs as expected for you, I'll close this issue.

gdalle commented 4 months ago

And if you use the main branch of DifferentiationInterface. Release coming tomorrow or so

gdalle commented 4 months ago

@willtebbutt it might be worth mentioning the preparation step in the readme when you discuss the DI bindings?

LilithHafner commented 4 months ago

Presumably the overhead on the first run even after build_rrule is still big enough that Chairmarks doesn't bother running the code several times, and just returns that one run.

This is correct. Chairmarks and BenchmarkTools both make the (in this case false) assumption that sample runtime is equal to evaluation runtime times number of evaluations, and reports the extrapolated evaluation runtime. I'll look into providing better results in this case.

willtebbutt commented 4 months ago
]add Chairmarks DifferentiationInterface Tapir
using Chairmarks, DifferentiationInterface
import Tapir

f(x) = sqrt(sum(abs2, x) * sum(abs2, x))

@b (
    x=rand(10000),
    extras=prepare_gradient(sum, AutoTapir(safe_mode=false), rand(10000))
) value_and_gradient(sum, AutoTapir(), _.x, _.extras)

I just ran the above and got

94.448 μs (2 allocs: 78.172 KiB)

which is what I would expect. I'm using versions:

  [0ca39b1e] Chairmarks v1.2.1
  [a0c0ee7d] DifferentiationInterface v0.4.0
  [07d77754] Tapir v0.2.12

Consequently I'm going to mark this as resolved. Thanks all for contributing to this -- it's been very useful to iron this problem out!