Closed tkoolen closed 7 years ago
Very interesting!
Are those code snippets for the constructors? I wonder what @julia_Type_71534
is? Can you track that down? If it is the constructor RotMatrix
, then we might actually need to create explicitly inlined inner constructors... but I'm not sure if that actually works or not, or indeed if that is the problem. The other constructor is for SMatrix
.
Clearly it is doing a function call - in theory we should be able to do this all in the one stack frame.
Does changing this definition here help?
immutable RotMatrix{N,T,L} <: Rotation{N,T} # which is <: AbstractMatrix{T}
mat::SMatrix{N, N, T, L} # The final parameter to SMatrix is the "length" of the matrix, 3 × 3 = 9
@inline RotMatrix(x::SMatrix{N,N,T,L}) = new(x)
end
PS - if you're interested, you might like to consider CoordinateTransformations for an alternative route to affine transformations (I didn't go through this level of benchmarking, however).
Yes, those LLVM code snippets were for the constructors, sorry for adding confusion with the *
methods.
Adding the @inline
constructor results in a stack overflow because it calls itself :-).
I just ran my benchmarks and @code_llvm
s again, and now there is no difference between the two! I'm not sure what happened. Maybe a Pkg.update()
changed things? I'm a little confused, but I'll close the issue in any case.
I was aware of CoordinateTransformations. I created my RigidBodyDynamics.Transform3D
type before CoordinateTransformations existed, and when I switched from Quaternions.jl to Rotations.jl it was easier to just change the rot
field of RigidBodyDynamics.Transform3D
from Quaternions.Quaternion
to Rotations.RotMatrix
than it was to switch to CoordinateTransformations.Transformation
. It's also one less dependency, and I don't currently need the additional functionality in CoordinateTransformations. Finally, I'm thinking about switching the underlying data of RigidBodyDynamics.Transform3D
to a single SMatrix{4, 4}
representing both rotation and translation as a homogeneous matrix, because it turns out that on modern processors with AVX instructions, multiplying SMatrix{4, 4}
s is actually significantly faster than 'exploiting the sparsity' by having separate rotation and translation fields (5.628 ns vs. 12.073 ns on my machine). The difference between these two options is negligible on machines without AVX instructions btw.
I'm a little confused as to what's causing the performance discrepancy between the following two pieces of code:
BenchmarkTools shows (with bounds checks turned off and with -O3) that
*
forT3DRotMatrix
is about twice as fast on my machine as forT3DRotMatrix
.Here's the code_llvm for
T3DRotMatrix
:and for
T3DSMatrix
:so the difference is
This is despite the fact that the
code_llvm
forRotMatrix * RotMatrix
is identical to that forSMatrix * SMatrix
, and likewise forRotMatrix * SVector
andSMatrix * SVector
.Could you help me understand what's going on? Could it be related to this TODO?