Open RussTedrake opened 2 years ago
I had originally hoped that I could do something as simply as shortcutting the CalcPositionKinematicsCache<AutoDiffXd>
method to call CallJacobian...
. But the APIs are not similar enough to make that work out.
In some initial profiling experiments RigidTransform<AutoDiffXd>::operator*
repeatedly showed up as a hot spot. In reading the code, my first observation is that the AutoDiff pipeline is currently unable to take advantage of the significant sparsity in the gradients of the RigidTransforms used in the MBP kinematics. Many transforms depend on only one joint; more generally they will depend on only parent joints.
I did a quick experiment here, in which I changed only RotationMatrix<AutoDiffXd>
to use a custom matrix type that implemented these sparser gradients. I hacked and slashed to make things compile (I'm sure we can do much better!), and cut corners in performance for less critical methods.
Before this change (on my mac laptop), I see:
----------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------------------------------
IiwaPositionConstraintFixture/EvalAutoDiffMbpDouble/iterations:1000 4456 ns 4456 ns 1000
IiwaPositionConstraintFixture/EvalAutoDiffMbpAutoDiff/iterations:1000 213834 ns 213802 ns 1000
after the change, I already see
----------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------------------------------
IiwaPositionConstraintFixture/EvalAutoDiffMbpDouble/iterations:1000 5804 ns 5804 ns 1000
IiwaPositionConstraintFixture/EvalAutoDiffMbpAutoDiff/iterations:1000 93115 ns 93106 ns 1000
I think we'll see bigger wins if we lift this up to the level of RigidTransform
, which wouldn't be much more work.
Wow!
fwiw -- I ran the same experiment on the cassie bench.
Before:
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
CassieDoubleFixture/DoubleMassMatrix 4678 ns 4678 ns 147933 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieDoubleFixture/DoubleInverseDynamics 5702 ns 5702 ns 121557 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieDoubleFixture/DoubleForwardDynamics 16550 ns 16549 ns 41888 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffMassMatrix 1965820 ns 1965785 ns 339 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffInverseDynamics 2651010 ns 2647926 ns 282 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffForwardDynamics 4475147 ns 4457648 ns 162 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieExpressionFixture/ExpressionMassMatrix 908470 ns 908307 ns 714
CassieExpressionFixture/ExpressionInverseDynamics 967414 ns 967331 ns 723
CassieExpressionFixture/ExpressionForwardDynamics 1654806 ns 1654664 ns 423
After:
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
CassieDoubleFixture/DoubleMassMatrix 4723 ns 4705 ns 149174 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieDoubleFixture/DoubleInverseDynamics 5776 ns 5776 ns 122465 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieDoubleFixture/DoubleForwardDynamics 16652 ns 16652 ns 42245 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffMassMatrix 1432117 ns 1432055 ns 488 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffInverseDynamics 1568468 ns 1568420 ns 460 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieAutodiffFixture/AutodiffForwardDynamics 3250954 ns 3250843 ns 223 Allocs.max=0 Allocs.mean=0 Allocs.min=0 Allocs.stddev=0
CassieExpressionFixture/ExpressionMassMatrix 853561 ns 853544 ns 802
CassieExpressionFixture/ExpressionInverseDynamics 942393 ns 942392 ns 733
CassieExpressionFixture/ExpressionForwardDynamics 1664506 ns 1653561 ns 426
cc @rpoyner-tri
Per f2f at onsite -- this looks promising. PTAL @rpoyner-tri when you have a chance.
FWIW - I took a quick look at lifting this to a template specialization of Eigen's PlainObjectBase
, but that was not an enjoyable exercise. The storage in that class is protected, not private, and it is quite hard to change because the derived classes depend on the details of its implementation.
Just coming back around to this. The numbers look promising. ~The link to dev code above is a bit of a land mine (points to "make a new PR" page). I'll see about fixing that bit.~ Link above fixed.
@RussTedrake are you still working on this or has the adoption team taken over?
Adoption hasn't taken over, but @aykut-tri (with @hongkai-dai guidance) might start working on this IIRC.
Oops, my mistake. @aykut-tri is not working on this issue; I got confused versus #16635, which is a separate work.
Is there someone who's going to carry this forward? Would it be @RussTedrake, Dynamics or Adoption. Seems like a very nice improvement and it'd be a shame if it just gets rotted away.
Thanks for the reminder, Xuchen. I'll take this if no one else has it. I'm in dire need of an interesting coding project.
@RussTedrake I built this on Ubuntu. I could run position_constraint_benchmark, but I see large gradient discrepancies in position_constraint_test failures. Were tests passing in your build?
I pushed a fixup of Russ's branch in #17236 (transpose was setting gradients unconditionally to zero). Once that passes all tests I'll rerun benchmarks for reference. Assuming the speedups are still there, I'll attempt a production-ready version that maintains or improves the performance.
With the fix, I'm still seeing about a 50% speedup (2X) on position_constraint_benchmark and 20% on cassie_bench. IMO still worth pursuing.
I've established a benchmark in
multibody/inverse_kinematics/position_constraint_benchmark
that demonstrates the difference between using AutoDiffXd vs using CalcJacobian for kinematic gradients. I will use this issue to track some experiments / performance improvements.