Improve `MultibodyPlant<AutoDiffXd>` performance: kinematics

RussTedrake commented 2 years ago

I've established a benchmark in multibody/inverse_kinematics/position_constraint_benchmark that demonstrates the difference between using AutoDiffXd vs using CalcJacobian for kinematic gradients. I will use this issue to track some experiments / performance improvements.

RussTedrake commented 2 years ago

I had originally hoped that I could do something as simply as shortcutting the CalcPositionKinematicsCache<AutoDiffXd> method to call CallJacobian.... But the APIs are not similar enough to make that work out.

In some initial profiling experiments RigidTransform<AutoDiffXd>::operator* repeatedly showed up as a hot spot. In reading the code, my first observation is that the AutoDiff pipeline is currently unable to take advantage of the significant sparsity in the gradients of the RigidTransforms used in the MBP kinematics. Many transforms depend on only one joint; more generally they will depend on only parent joints.

I did a quick experiment here, in which I changed only RotationMatrix<AutoDiffXd> to use a custom matrix type that implemented these sparser gradients. I hacked and slashed to make things compile (I'm sure we can do much better!), and cut corners in performance for less critical methods.

Before this change (on my mac laptop), I see:

----------------------------------------------------------------------------------------------------------------
Benchmark                                                                      Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------
IiwaPositionConstraintFixture/EvalAutoDiffMbpDouble/iterations:1000         4456 ns         4456 ns         1000
IiwaPositionConstraintFixture/EvalAutoDiffMbpAutoDiff/iterations:1000     213834 ns       213802 ns         1000

after the change, I already see

----------------------------------------------------------------------------------------------------------------
Benchmark                                                                      Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------
IiwaPositionConstraintFixture/EvalAutoDiffMbpDouble/iterations:1000         5804 ns         5804 ns         1000
IiwaPositionConstraintFixture/EvalAutoDiffMbpAutoDiff/iterations:1000      93115 ns        93106 ns         1000

I think we'll see bigger wins if we lift this up to the level of RigidTransform, which wouldn't be much more work.

sherm1 commented 2 years ago

Wow!

RussTedrake commented 2 years ago

fwiw -- I ran the same experiment on the cassie bench.

Before:

------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
CassieDoubleFixture/DoubleMassMatrix                    4678 ns         4678 ns       147933 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieDoubleFixture/DoubleInverseDynamics               5702 ns         5702 ns       121557 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieDoubleFixture/DoubleForwardDynamics              16550 ns        16549 ns        41888 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffMassMatrix             1965820 ns      1965785 ns          339 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffInverseDynamics        2651010 ns      2647926 ns          282 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffForwardDynamics        4475147 ns      4457648 ns          162 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieExpressionFixture/ExpressionMassMatrix          908470 ns       908307 ns          714
CassieExpressionFixture/ExpressionInverseDynamics     967414 ns       967331 ns          723
CassieExpressionFixture/ExpressionForwardDynamics    1654806 ns      1654664 ns          423

After:

------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
CassieDoubleFixture/DoubleMassMatrix                    4723 ns         4705 ns       149174 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieDoubleFixture/DoubleInverseDynamics               5776 ns         5776 ns       122465 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieDoubleFixture/DoubleForwardDynamics              16652 ns        16652 ns        42245 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffMassMatrix             1432117 ns      1432055 ns          488 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffInverseDynamics        1568468 ns      1568420 ns          460 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffForwardDynamics        3250954 ns      3250843 ns          223 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieExpressionFixture/ExpressionMassMatrix          853561 ns       853544 ns          802
CassieExpressionFixture/ExpressionInverseDynamics     942393 ns       942392 ns          733
CassieExpressionFixture/ExpressionForwardDynamics    1664506 ns      1653561 ns          426

cc @rpoyner-tri

sherm1 commented 2 years ago

Per f2f at onsite -- this looks promising. PTAL @rpoyner-tri when you have a chance.

RussTedrake commented 2 years ago

FWIW - I took a quick look at lifting this to a template specialization of Eigen's PlainObjectBase, but that was not an enjoyable exercise. The storage in that class is protected, not private, and it is quite hard to change because the derived classes depend on the details of its implementation.

rpoyner-tri commented 2 years ago

Just coming back around to this. The numbers look promising. ~The link to dev code above is a bit of a land mine (points to "make a new PR" page). I'll see about fixing that bit.~ Link above fixed.

xuchenhan-tri commented 2 years ago

@RussTedrake are you still working on this or has the adoption team taken over?

jwnimmer-tri commented 2 years ago

Adoption hasn't taken over, but @aykut-tri (with @hongkai-dai guidance) might start working on this IIRC.

jwnimmer-tri commented 2 years ago

Oops, my mistake. @aykut-tri is not working on this issue; I got confused versus #16635, which is a separate work.

xuchenhan-tri commented 2 years ago

Is there someone who's going to carry this forward? Would it be @RussTedrake, Dynamics or Adoption. Seems like a very nice improvement and it'd be a shame if it just gets rotted away.

sherm1 commented 2 years ago

Thanks for the reminder, Xuchen. I'll take this if no one else has it. I'm in dire need of an interesting coding project.

sherm1 commented 2 years ago

@RussTedrake I built this on Ubuntu. I could run position_constraint_benchmark, but I see large gradient discrepancies in position_constraint_test failures. Were tests passing in your build?

sherm1 commented 2 years ago

I pushed a fixup of Russ's branch in #17236 (transpose was setting gradients unconditionally to zero). Once that passes all tests I'll rerun benchmarks for reference. Assuming the speedups are still there, I'll attempt a production-ready version that maintains or improves the performance.

sherm1 commented 2 years ago

With the fix, I'm still seeing about a 50% speedup (2X) on position_constraint_benchmark and 20% on cassie_bench. IMO still worth pursuing.

RobotLocomotion / drake

Improve `MultibodyPlant<AutoDiffXd>` performance: kinematics #16877