meson800 commented 10 months ago

Summary

The previous ExtendedJumpArray code contains code that implements part of the broadcastable interface, but is not actually used due to the entire interface not being included. This commit replaces that broadcast code, reusing much of the arg packing/repacking. Broadcasting operations that only rely on scalars and ExtendedJumpArray's will now use an efficient implementation that can emit SIMD instructions at the native and LLVM code levels. This closes #335. Benchmarking shows between a two to five fold increase in performance with this broadcasting method.

The previous code uses the fallback indexing method when broadcasting, which requires branching inside the broadcast loop. This branch is likely well-branch-predicted, but the presence of the branch prevents kernel functions from being fused into efficient SIMD instructions.

Implementation

This PR defines a ExtendedJumpArrayStyle, a broadcast style that includes two sub-styles that represent the styles for broadcasting the u and jump_u entries. copyto! overloads are specified that do the efficient thing and call copyto!, using the broadcast computation tree, on u and jump_u separately. Binary style rules are defined such that broadcasts fall back to the old method if you do something like adding an ExtendedJumpArray to a normal vector.

The arg tuple unpacking is mostly the same as the old partial implementation that was there; I just renamed the functions and unified them with the new broadcasting style.

A Julia >=1.9 extension is added for FastBroadcast.jl, where the equivalent of the copyto! call, fast_materialize! is similarly overloaded. This only adds a weak dependency, so only users who also use FastBroadcast (such as OrdinaryDiffEq) will have these methods loaded; those that don't already have FastBroadcast in the environment will not pull it in as a dependency.

Tests are added to check that the new broadcasting behavior is behaving as expected. In addition, a benchmark file is included that can be used to confirm that ExtendedJumpArray broadcasting is just as fast as broadcasting of normal Vector{Float64}'s. This PR passes all other tests, including e.g. the ones that use auto differentiation; broadcasting works great even in this case :)

Remaining problems/notes

Inhomogeneous ExtendedJumpArray's still act strangely under broadcasting, but no more strangely than they were already acting. In fact, some of the VariableRateJump tests rely on the behavior that an ExtendedJumpArray plus a Vector decay to become a Vector.

The new broadcasting method means that if you define an output array that is inhomogeneous (e.g. one where the A.u is a Vector{Float64} but A.jump_u is a Vector{Int64}), then broadcasted operations will successfully maintain the inhomogeneity. However, if you allow the broadcast machinery to create an output array for you (allocating), it will just allocate a homogenous ExtendedJumpArray with both u and jump_u being Vector{Float64}'s. This is the same behavior as the fallback, so this should not break any existing downstream code.

Finally, I just used an implementation for selecting an output ExtendedJumpArray while broadcasting based on the Julia interface documentation; this is a recursive function that is used to unpack the broadcasted tuple in a type-stable way, but there can be a few allocations that happen prior to the broadcast if someone broadcasts a very complicated expression that exceeds a recursion depth of 16. I was trying to replace it with a concise generated function that would always be type stable, but I think it is far less likely to hit this limit than in my other recent PRs. If there is an instance where this happens, I can fix this.

coveralls commented 10 months ago

Pull Request Test Coverage Report for Build 6026823188

37 of 42 (88.1%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+1.4%) to 89.861%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/extended_jump_array.jl	33	34	97.06%
ext/JumpProcessFastBroadcastExt.jl	4	8	50.0%
<!--	Total:	37	42	88.1%	-->

Totals
Change from base Build 5658974660:	1.4%
Covered Lines:	2269
Relevant Lines:	2525

💛 - Coveralls

ChrisRackauckas commented 10 months ago

This looks great! Just a few minor pieces mentioned above and I think this is good to go. Comments:

The new broadcasting method means that if you define an output array that is inhomogeneous (e.g. one where the A.u is a Vector{Float64} but A.jump_u is a Vector{Int64}), then broadcasted operations will successfully maintain the inhomogeneity. However, if you allow the broadcast machinery to create an output array for you (allocating), it will just allocate a homogenous ExtendedJumpArray with both u and jump_u being Vector{Float64}'s. This is the same behavior as the fallback, so this should not break any existing downstream code.

👍 that should be fine.

Finally, I just used an implementation for selecting an output ExtendedJumpArray while broadcasting based on the Julia interface documentation; this is a recursive function that is used to unpack the broadcasted tuple in a type-stable way, but there can be a few allocations that happen prior to the broadcast if someone broadcasts a very complicated expression that exceeds a recursion depth of 16. I was trying to replace it with a concise generated function that would always be type stable, but I think it is far less likely to hit this limit than in my other recent PRs. If there is an instance where this happens, I can fix this.

This is something that FastBroadcast.jl handles, so in practice we're fine.

isaacsas commented 10 months ago

Would be interesting to redo the VariableRate benchmarks once this is merged and see the performance vs. Coevolve now.

meson800 commented 10 months ago

Just a few minor pieces mentioned above and I think this is good to go

I added a commit to address those comments, thanks!

This is something that FastBroadcast.jl handles, so in practice we're fine.

I think the problem could still exist, as FastBroadcast also calls out to similar. In the ExtendedJumpArrayStyle method of similar is where find_eja is used (in order to locate the ExtendedJumpArray in the arg list to determine output axis sizes); if the argument list is too deep then the compiler might give up type inference for the recursive call, leading to some inefficient boxing/unboxing.

Would be interesting to redo the VariableRate benchmarks once this is merged and see the performance vs. Coevolve now.

I expect that the broadcasting speedup is potentially less for short ExtendedJumpArray's because they benefit less from SIMD.

isaacsas commented 10 months ago

@meson800 could you see if this issue is also closed now:

https://github.com/SciML/JumpProcesses.jl/issues/321

If so can you add a related test using the simple example in that issue in your extended jump array tests file? (I only ask as it should be super easy to check.)

Thanks!

meson800 commented 10 months ago

Yes, #321 looks like it is fixed now by the broadcasting changes. Tests added and pushed. It possibly won't be fixed pre Julia 1.9 though, because the FastBroadcast extension won't be loaded.

meson800 commented 10 months ago

We'll see when the 1.6 CI finishes, but I think that this is still fine on < Julia 1.9. It won't load the extension so the FastBroadcasting will be slower, but the broadcast rules should still load and correctly select the output type, fixing the problem in #321.

ChrisRackauckas commented 10 months ago

We'll see when the 1.6 CI finishes, but I think that this is still fine on < Julia 1.9. It won't load the extension so the FastBroadcasting will be slower, but the broadcast rules should still load and correctly select the output type, fixing the problem in https://github.com/SciML/JumpProcesses.jl/issues/321.

Yes, it should just fallback to being slower, which is fine for earlier versions. We need to just move forward and get the new LTS 😅

isaacsas commented 10 months ago

@meson800 thanks for this! I suspect you've actually closed a bunch of other open ExtendedJumpArray-related issues with this.

SciML / JumpProcesses.jl

Add non-default broadcasting of ExtendedJumpArray's. #340

Summary

Implementation

Remaining problems/notes

Pull Request Test Coverage Report for Build 6026823188

💛 - Coveralls