JuliaDiff / ForwardDiff.jl

Forward Mode Automatic Differentiation for Julia
Other
888 stars 142 forks source link

make Chunk(::StaticArray) type stable by default #707

Closed ExpandingMan closed 1 month ago

ExpandingMan commented 1 month ago

The function Chunk(x) is manifestly type unstable. In many cases when the value is not directly used, the compiler is clever enough to elide this, but in some cases its mere presence can cause significant performance penalties.

In particular, I encountered this when trying to get DifferentiationInterface.jl to behave nicely with static arrays. That package calls GradientConfig (or JacobianConfig, etc) and for whatever reason the compiler gets upset about the calls to Chunk and the whole thing is wildly inefficient.

This PR attempts to address that issue by defining a StaticArray method for Chunk and setting the size to the arrays length. It is not entirely clear to me that this is the desired behavior, but dualize already seems to do this. Also this ignores the global chunk size limit setting, but again, dualize appears to already do this.

Note that I have explicitly confirmed that the @inferred tests I have added fail without this PR but succeed with it.

(btw I am getting a failure in one of the tests for allocations during seed! when I run on my end, but it looks totally unrelated to this so I'm not sure what's going on there. Maybe it'll pass in CI/CD, who knows)

mcabbott commented 1 month ago

As you say chunk size = length does seem to be the present behaviour, ForwardDiff.gradient(x -> (@show x[1]), @SArray rand(1,100)).

If you explicitly make a GradientConfig it seems that obeys DEFAULT_CHUNK_THRESHOLD = 12, but this is ignored once you pass it to gradient? That's odd:

julia> f1(x) = @show x[1];

julia> ForwardDiff.Chunk(@SMatrix rand(1,100))
ForwardDiff.Chunk{12}()

julia> ForwardDiff.GradientConfig(f1, @SMatrix rand(1,100)).seeds
(Partials(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), Partials(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), Partials(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), Partials(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), Partials(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), Partials(0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0), Partials(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0), Partials(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0), Partials(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0), Partials(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0), Partials(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0), Partials(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0))

julia> ForwardDiff.gradient(f1, @SArray(rand(1,100)), ForwardDiff.GradientConfig(f1, @SArray rand(1,100)))
x[1] = Dual{ForwardDiff.Tag{typeof(f1), Float64}}(0.6424264074844214,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0)
1×100 SMatrix{1, 100, Float64, 100} with indices SOneTo(1)×SOneTo(100):
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

Using chunk size 100 might be sub-optimal. I would guess that whatever compiler limitations StaticArrays runs into for long arrays likely apply to long tuples of partials too, so this thing with 10^4 numbers might be awful? Haven't timed it and perhaps picking a better heuristic isn't this PR's problem anyway.

ExpandingMan commented 1 month ago

I have attempted an alternate scheme based partly on @avik-pal 's suggestion. It seems to work when I test it locally, though it is certainly rather precarious and scary looking as it assumes inference can follow a bunch of logic without breaking. I think what I have here now would be considered non-breaking though.

For whatever it's worth, were I to do this from scratch, I'd probably split Chunk into two different objects. The first would be a mere struct Chunk size::Int; end which would be used for ordinary arrays and the latter would be like the current Chunk{N} which could be used with static arrays or when a user desires. In the standard array case it does not seem justified to think the type Chunk{N} should be inferrable so I'm not sure what business it has having a type parameter.

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.59%. Comparing base (c310fb5) to head (a4b3e4e). Report is 7 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #707 +/- ## ========================================== + Coverage 89.57% 89.59% +0.01% ========================================== Files 11 11 Lines 969 980 +11 ========================================== + Hits 868 878 +10 - Misses 101 102 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

KristofferC commented 1 month ago

For me, the following diff is enough:

-const CHUNKS = [Chunk{i}() for i in 1:DEFAULT_CHUNK_THRESHOLD]
- 
 function Chunk(input_length::Integer, threshold::Integer = DEFAULT_CHUNK_THRESHOLD)
     N = pickchunksize(input_length, threshold)
-    0 < N <= DEFAULT_CHUNK_THRESHOLD && return CHUNKS[N]
+    0 < N <= DEFAULT_CHUNK_THRESHOLD && return Chunk{N}()
     return Chunk{N}()
 end
julia> @inferred ForwardDiff.GradientConfig(f, sx);

The reason the const CHUNKS is there is historical where Julia was bad at optimizing a certain thing (https://github.com/JuliaLang/julia/issues/29887) but I don't think it should be needed now.

This should be benchmarked again with this change though: https://github.com/JuliaDiff/ForwardDiff.jl/pull/373

KristofferC commented 1 month ago

Made an alternative PR at https://github.com/JuliaDiff/ForwardDiff.jl/pull/708

gdalle commented 1 month ago

FYI @ExpandingMan the idea of DifferentiationInterface is to prepare the operator once and then reuse the preparation several times. For gradient and Jacobian, the preparation phase does involve a type-unstable code to Chunk, but then the resulting extras object allows for type-stable application of the operator. In this spirit, it's okay for preparation to be type unstable (and currently it has to be with ForwardDiff), but the code you put in hot loops should be the prepared version of the operator. See the tutorial at https://gdalle.github.io/DifferentiationInterface.jl/DifferentiationInterface/stable/tutorial1/#Preparing-for-multiple-gradients for an example.

ExpandingMan commented 1 month ago

Thanks @KristofferC , I have confirmed that #708 works, so that seems obviously preferable to this, I will close this.