Open MilesCranmer opened 1 year ago
If you only care about operators like +
, -
, *
, /
, cos
, and exp
, you could instead using ForwardDiff
following the approach I used in SimpleChains.jl.
Because LoopVectorization.@turbo
works on Expr
at a high level, it is only able to understand primitive operations. Therefore it will naively assume the operations it doesn't know are expensive.
It can use the type of an operation to figure out what it is, but it won't introspect a more complicated function.
So it would be possible to do better than the DualEval
example above.
Note that ChainRules are defined on LoopVectorization.vmap
.
I suggest poking around using Cthulhu
to make sure there aren't any accidental type instabilities. Turning optimizations on will also tell you what is and isn't inlined.
Thanks for the quick and detailed response!
I've been trying my best to understand the code you sent, but it is a bit outside of my wheelhouse and many of these calls are new to me. ForwardDiff.Dual
doesn't seem to have docs available. From what I can guess, the code is trying to expand ForwardDiff.derivative
within @turbo
so that it sees the exact operators before it tries to vectorize them?
Here's a MWE I have:
function f(x, op::F, turbo) where {F}
y = similar(x)
if turbo
@turbo for i in eachindex(x)
y[i] = op(x[i])
end
else
@inbounds @simd for i in eachindex(x)
y[i] = op(x[i])
end
end
return y
end
we can see that turbo=true
improves this significantly if I just pass in cos
:
julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=true; op=cos);
19.250 μs (2 allocations: 78.17 KiB)
julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=false; op=cos);
117.708 μs (2 allocations: 78.17 KiB)
But when I pass in ForwardDiff.derivative
of cos
, the performance gains are not as good, despite the fact that op
should just be sin
:
julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=true; op=Base.Fix1(ForwardDiff.derivative, cos));
57.083 μs (2 allocations: 78.17 KiB)
julia> @btime f(x, op, turbo) setup=(x=randn(10000); turbo=false; op=Base.Fix1(ForwardDiff.derivative, cos));
143.625 μs (2 allocations: 78.17 KiB)
Maybe this is not possible and I should try Enzyme.jl instead?
I am trying to understand a performance issue I am seeing in DynamicExpressions.jl where using
@turbo
makes the evaluation kernels 4x faster, but makes the derivative kernels 10% slower. See the detailed benchmarks here: https://github.com/SymbolicML/DynamicExpressions.jl/pull/28#issuecomment-1529108964My derivative kernels look like this:
(The
@maybe_turbo turbo ...
will turn into@turbo ...
whenturbo=true
, but just@inbounds @simd ...
otherwise. It will also remove the various type assertions in the scope.)To create the
diff_op
, I generate it using Zygote.jl here:I can try to create a MWE for this, but I quickly wanted to check if anything was obvious in how I am using
@turbo
here that might hurt performance rather than help it. For example, perhaps thisdiff_op
is not being inlined correctly, and therefore not being optimized by@turbo
? For the record I am not seeing any warnings about the derivative operator being incompatible, so I'm not quite sure why this is occurring.Also - the
diff_op
in the benchmark is the derivative of one of+, -, *, /, cos, exp
so nothing too crazy.