Help understanding performance discrepancy

tkoolen commented 7 years ago

I'm a little confused as to what's causing the performance discrepancy between the following two pieces of code:

immutable T3DRotMatrix{T}
    rot::RotMatrix{3, T, 9}
    trans::SVector{3, T}
end

@inline *(t1::T3DRotMatrix, t2::T3DRotMatrix) = T3DRotMatrix(t1.rot * t2.rot, t1.trans + t1.rot * t2.trans)

immutable T3DSMatrix{T}
    rot::SMatrix{3, 3, T, 9}
    trans::SVector{3, T}
end

@inline *(t1::T3DSMatrix, t2::T3DSMatrix) = T3DSMatrix(t1.rot * t2.rot, t1.trans + t1.rot * t2.trans)

BenchmarkTools shows (with bounds checks turned off and with -O3) that * for T3DRotMatrix is about twice as fast on my machine as for T3DRotMatrix.

Here's the code_llvm for T3DRotMatrix:

define void @julia_Type_71814(%T3DRotMatrix* noalias sret, %jl_value_t*, %RotMatrix*, %SVector*) #0 {
top:
  %4 = alloca %T3DRotMatrix, align 8
  %5 = alloca [9 x double], align 8
  call void @julia_Type_71534([9 x double]* noalias nonnull sret %5, %jl_value_t* inttoptr (i64 4445652016 to %jl_value_t*), %RotMatrix* %2) #0
  %6 = bitcast [9 x double]* %5 to i8*
  %7 = bitcast %T3DRotMatrix* %4 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %7, i8* %6, i64 72, i32 8, i1 false)
  %8 = bitcast %T3DRotMatrix* %4 to i8*
  %9 = bitcast %SVector* %3 to i8*
  %10 = getelementptr inbounds %T3DRotMatrix, %T3DRotMatrix* %4, i64 0, i32 1
  %11 = bitcast %SVector* %10 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %11, i8* %9, i64 24, i32 1, i1 false)
  %12 = bitcast %T3DRotMatrix* %0 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %12, i8* %8, i64 96, i32 8, i1 false)
  ret void
}

and for T3DSMatrix:

define void @julia_Type_71869(%T3DSMatrix* noalias sret, %jl_value_t*, %SMatrix*, %SVector*) #0 {
top:
  %4 = alloca %T3DSMatrix, align 8
  %5 = bitcast %SMatrix* %2 to i8*
  %6 = bitcast %T3DSMatrix* %4 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %6, i8* %5, i64 72, i32 1, i1 false)
  %7 = bitcast %T3DSMatrix* %4 to i8*
  %8 = bitcast %SVector* %3 to i8*
  %9 = getelementptr inbounds %T3DSMatrix, %T3DSMatrix* %4, i64 0, i32 1
  %10 = bitcast %SVector* %9 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %10, i8* %8, i64 24, i32 1, i1 false)
  %11 = bitcast %T3DSMatrix* %0 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %11, i8* %7, i64 96, i32 8, i1 false)
  ret void
}

so the difference is

  %5 = alloca [9 x double], align 8
  call void @julia_Type_71534([9 x double]* noalias nonnull sret %5, %jl_value_t* inttoptr (i64 4445652016 to %jl_value_t*), %RotMatrix* %2) #0

This is despite the fact that the code_llvm for RotMatrix * RotMatrix is identical to that for SMatrix * SMatrix, and likewise for RotMatrix * SVector and SMatrix * SVector.

Could you help me understand what's going on? Could it be related to this TODO?

andyferris commented 7 years ago

Very interesting!

Are those code snippets for the constructors? I wonder what @julia_Type_71534 is? Can you track that down? If it is the constructor RotMatrix, then we might actually need to create explicitly inlined inner constructors... but I'm not sure if that actually works or not, or indeed if that is the problem. The other constructor is for SMatrix.

Clearly it is doing a function call - in theory we should be able to do this all in the one stack frame.

Does changing this definition here help?

immutable RotMatrix{N,T,L} <: Rotation{N,T} # which is <: AbstractMatrix{T}
    mat::SMatrix{N, N, T, L} # The final parameter to SMatrix is the "length" of the matrix, 3 × 3 = 9

    @inline RotMatrix(x::SMatrix{N,N,T,L}) = new(x)
end

PS - if you're interested, you might like to consider CoordinateTransformations for an alternative route to affine transformations (I didn't go through this level of benchmarking, however).

tkoolen commented 7 years ago

Yes, those LLVM code snippets were for the constructors, sorry for adding confusion with the * methods.

Adding the @inline constructor results in a stack overflow because it calls itself :-).

I just ran my benchmarks and @code_llvms again, and now there is no difference between the two! I'm not sure what happened. Maybe a Pkg.update() changed things? I'm a little confused, but I'll close the issue in any case.

I was aware of CoordinateTransformations. I created my RigidBodyDynamics.Transform3D type before CoordinateTransformations existed, and when I switched from Quaternions.jl to Rotations.jl it was easier to just change the rot field of RigidBodyDynamics.Transform3D from Quaternions.Quaternion to Rotations.RotMatrix than it was to switch to CoordinateTransformations.Transformation. It's also one less dependency, and I don't currently need the additional functionality in CoordinateTransformations. Finally, I'm thinking about switching the underlying data of RigidBodyDynamics.Transform3D to a single SMatrix{4, 4} representing both rotation and translation as a homogeneous matrix, because it turns out that on modern processors with AVX instructions, multiplying SMatrix{4, 4}s is actually significantly faster than 'exploiting the sparsity' by having separate rotation and translation fields (5.628 ns vs. 12.073 ns on my machine). The difference between these two options is negligible on machines without AVX instructions btw.

JuliaGeometry / Rotations.jl

Help understanding performance discrepancy #22