Closed meson800 closed 10 months ago
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
src/extended_jump_array.jl | 33 | 34 | 97.06% | ||
ext/JumpProcessFastBroadcastExt.jl | 4 | 8 | 50.0% | ||
<!-- | Total: | 37 | 42 | 88.1% | --> |
Totals | |
---|---|
Change from base Build 5658974660: | 1.4% |
Covered Lines: | 2269 |
Relevant Lines: | 2525 |
This looks great! Just a few minor pieces mentioned above and I think this is good to go. Comments:
The new broadcasting method means that if you define an output array that is inhomogeneous (e.g. one where the A.u is a Vector{Float64} but A.jump_u is a Vector{Int64}), then broadcasted operations will successfully maintain the inhomogeneity. However, if you allow the broadcast machinery to create an output array for you (allocating), it will just allocate a homogenous ExtendedJumpArray with both u and jump_u being Vector{Float64}'s. This is the same behavior as the fallback, so this should not break any existing downstream code.
👍 that should be fine.
Finally, I just used an implementation for selecting an output ExtendedJumpArray while broadcasting based on the Julia interface documentation; this is a recursive function that is used to unpack the broadcasted tuple in a type-stable way, but there can be a few allocations that happen prior to the broadcast if someone broadcasts a very complicated expression that exceeds a recursion depth of 16. I was trying to replace it with a concise generated function that would always be type stable, but I think it is far less likely to hit this limit than in my other recent PRs. If there is an instance where this happens, I can fix this.
This is something that FastBroadcast.jl handles, so in practice we're fine.
Would be interesting to redo the VariableRate benchmarks once this is merged and see the performance vs. Coevolve now.
Just a few minor pieces mentioned above and I think this is good to go
I added a commit to address those comments, thanks!
This is something that FastBroadcast.jl handles, so in practice we're fine.
I think the problem could still exist, as FastBroadcast also calls out to similar
. In the ExtendedJumpArrayStyle
method of similar
is where find_eja
is used (in order to locate the ExtendedJumpArray
in the arg list to determine output axis sizes); if the argument list is too deep then the compiler might give up type inference for the recursive call, leading to some inefficient boxing/unboxing.
Would be interesting to redo the VariableRate benchmarks once this is merged and see the performance vs. Coevolve now.
I expect that the broadcasting speedup is potentially less for short ExtendedJumpArray
's because they benefit less from SIMD.
@meson800 could you see if this issue is also closed now:
https://github.com/SciML/JumpProcesses.jl/issues/321
If so can you add a related test using the simple example in that issue in your extended jump array tests file? (I only ask as it should be super easy to check.)
Thanks!
Yes, #321 looks like it is fixed now by the broadcasting changes. Tests added and pushed. It possibly won't be fixed pre Julia 1.9 though, because the FastBroadcast extension won't be loaded.
We'll see when the 1.6 CI finishes, but I think that this is still fine on < Julia 1.9. It won't load the extension so the FastBroadcasting will be slower, but the broadcast rules should still load and correctly select the output type, fixing the problem in #321.
We'll see when the 1.6 CI finishes, but I think that this is still fine on < Julia 1.9. It won't load the extension so the FastBroadcasting will be slower, but the broadcast rules should still load and correctly select the output type, fixing the problem in https://github.com/SciML/JumpProcesses.jl/issues/321.
Yes, it should just fallback to being slower, which is fine for earlier versions. We need to just move forward and get the new LTS 😅
@meson800 thanks for this! I suspect you've actually closed a bunch of other open ExtendedJumpArray
-related issues with this.
Summary
The previous ExtendedJumpArray code contains code that implements part of the broadcastable interface, but is not actually used due to the entire interface not being included. This commit replaces that broadcast code, reusing much of the arg packing/repacking. Broadcasting operations that only rely on scalars and
ExtendedJumpArray
's will now use an efficient implementation that can emit SIMD instructions at the native and LLVM code levels. This closes #335. Benchmarking shows between a two to five fold increase in performance with this broadcasting method.The previous code uses the fallback indexing method when broadcasting, which requires branching inside the broadcast loop. This branch is likely well-branch-predicted, but the presence of the branch prevents kernel functions from being fused into efficient SIMD instructions.
Implementation
This PR defines a ExtendedJumpArrayStyle, a broadcast style that includes two sub-styles that represent the styles for broadcasting the
u
andjump_u
entries.copyto!
overloads are specified that do the efficient thing and callcopyto!
, using the broadcast computation tree, onu
andjump_u
separately. Binary style rules are defined such that broadcasts fall back to the old method if you do something like adding an ExtendedJumpArray to a normal vector.The arg tuple unpacking is mostly the same as the old partial implementation that was there; I just renamed the functions and unified them with the new broadcasting style.
A Julia >=1.9 extension is added for FastBroadcast.jl, where the equivalent of the
copyto!
call,fast_materialize!
is similarly overloaded. This only adds a weak dependency, so only users who also useFastBroadcast
(such asOrdinaryDiffEq
) will have these methods loaded; those that don't already have FastBroadcast in the environment will not pull it in as a dependency.Tests are added to check that the new broadcasting behavior is behaving as expected. In addition, a benchmark file is included that can be used to confirm that
ExtendedJumpArray
broadcasting is just as fast as broadcasting of normalVector{Float64}
's. This PR passes all other tests, including e.g. the ones that use auto differentiation; broadcasting works great even in this case :)Remaining problems/notes
Inhomogeneous
ExtendedJumpArray
's still act strangely under broadcasting, but no more strangely than they were already acting. In fact, some of the VariableRateJump tests rely on the behavior that anExtendedJumpArray
plus aVector
decay to become aVector
.The new broadcasting method means that if you define an output array that is inhomogeneous (e.g. one where the
A.u
is aVector{Float64}
butA.jump_u
is aVector{Int64}
), then broadcasted operations will successfully maintain the inhomogeneity. However, if you allow the broadcast machinery to create an output array for you (allocating), it will just allocate a homogenous ExtendedJumpArray with bothu
andjump_u
beingVector{Float64}
's. This is the same behavior as the fallback, so this should not break any existing downstream code.Finally, I just used an implementation for selecting an output
ExtendedJumpArray
while broadcasting based on the Julia interface documentation; this is a recursive function that is used to unpack the broadcasted tuple in a type-stable way, but there can be a few allocations that happen prior to the broadcast if someone broadcasts a very complicated expression that exceeds a recursion depth of 16. I was trying to replace it with a concise generated function that would always be type stable, but I think it is far less likely to hit this limit than in my other recent PRs. If there is an instance where this happens, I can fix this.