JuliaSIMD / LoopVectorization.jl

Macro(s) for vectorizing loops.
MIT License
741 stars 66 forks source link

Performance issue with Zygote.jl-generated function within `@turbo` #489

Open MilesCranmer opened 1 year ago

MilesCranmer commented 1 year ago

I am trying to understand a performance issue I am seeing in DynamicExpressions.jl where using @turbo makes the evaluation kernels 4x faster, but makes the derivative kernels 10% slower. See the detailed benchmarks here: https://github.com/SymbolicML/DynamicExpressions.jl/pull/28#issuecomment-1529108964

My derivative kernels look like this:

@maybe_turbo turbo for j in indices((cumulator, dcumulator))
    x = op(cumulator[j])::T
    dx = diff_op(cumulator[j])::T * dcumulator[j]

    cumulator[j] = x
    dcumulator[j] = dx
end

(The @maybe_turbo turbo ... will turn into @turbo ... when turbo=true, but just @inbounds @simd ... otherwise. It will also remove the various type assertions in the scope.)

To create the diff_op, I generate it using Zygote.jl here:

for op in unary_operators
    diff_op(x) = gradient(op, x)[1]
    push!(diff_unary_operators, diff_op)
end

I can try to create a MWE for this, but I quickly wanted to check if anything was obvious in how I am using @turbo here that might hurt performance rather than help it. For example, perhaps this diff_op is not being inlined correctly, and therefore not being optimized by @turbo? For the record I am not seeing any warnings about the derivative operator being incompatible, so I'm not quite sure why this is occurring.

Also - the diff_op in the benchmark is the derivative of one of +, -, *, /, cos, exp so nothing too crazy.

chriselrod commented 1 year ago

If you only care about operators like +, -, *, /, cos, and exp, you could instead using ForwardDiff following the approach I used in SimpleChains.jl.

Because LoopVectorization.@turbo works on Expr at a high level, it is only able to understand primitive operations. Therefore it will naively assume the operations it doesn't know are expensive. It can use the type of an operation to figure out what it is, but it won't introspect a more complicated function. So it would be possible to do better than the DualEval example above.

Note that ChainRules are defined on LoopVectorization.vmap.

I suggest poking around using Cthulhu to make sure there aren't any accidental type instabilities. Turning optimizations on will also tell you what is and isn't inlined.

MilesCranmer commented 1 year ago

Thanks for the quick and detailed response!

I've been trying my best to understand the code you sent, but it is a bit outside of my wheelhouse and many of these calls are new to me. ForwardDiff.Dual doesn't seem to have docs available. From what I can guess, the code is trying to expand ForwardDiff.derivative within @turbo so that it sees the exact operators before it tries to vectorize them?

Here's a MWE I have:

function f(x, op::F, turbo) where {F}
   y = similar(x)
   if turbo
       @turbo for i in eachindex(x)
           y[i] = op(x[i])
       end
   else
       @inbounds @simd for i in eachindex(x)
           y[i] = op(x[i])
       end
   end
   return y
end

we can see that turbo=true improves this significantly if I just pass in cos:

julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=true; op=cos);
  19.250 μs (2 allocations: 78.17 KiB)

julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=false; op=cos);
  117.708 μs (2 allocations: 78.17 KiB)

But when I pass in ForwardDiff.derivative of cos, the performance gains are not as good, despite the fact that op should just be sin:

julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=true; op=Base.Fix1(ForwardDiff.derivative, cos));
  57.083 μs (2 allocations: 78.17 KiB)

julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=false; op=Base.Fix1(ForwardDiff.derivative, cos));
  143.625 μs (2 allocations: 78.17 KiB)

Maybe this is not possible and I should try Enzyme.jl instead?