Open RussTedrake opened 2 years ago
Possibly drake/math/linear_solve.h
(or #15874) is also relevant.
\CC @hongkai-dai FYI
Yes. I'm planning to use #15874 in my short term fix.
Seems like @amcastro-tri is quite involved on this issue. Do you already have some Eigen-only repro? If not, do you have plans to work on this more?
Otherwise, I can take a look this week.
Russ, it looks like the mass matrix in SolveWithGuess() is 10x10, but I think the cart/pole is only 2 dofs. Did you use a different system to produce the final output above?
@sherm1 -- not sure where you are getting 10x10? The cart pole was the system under test for all of these. There are 5 derivatives (4 states and 1 control input).
@xuchenhan-tri -- would love if you want to take it!
If @amcastro-tri doesn't already have some work in progress or doesn't have the bandwidth to follow through with this, I can investigate more later this week and report back.
I definitely won't have time to work on this issue this week @xuchenhan-tri. If you think you can make progress this week, please feel free to re-assign. If anything, I can help with reviews.
Should we rename or close and open another issue? #17038, already solves the problem in the title for this issue.
@sherm1 -- not sure where you are getting 10x10? The cart pole was the system under test for all of these. There are 5 derivatives (4 states and 1 control input).
Sorry, misread the output! When you say "5 derivatives" do you mean partials of p_star[0] and p_star[1] with respect to 5 variables? So dp=[∂p₀/∂q₀ ∂p₀/∂q₁ ∂p₀/∂v₀ ... ∂p₁/∂u]
(in some order)?
I was able to reproduce @RussTedrake's failure in Mac x86.
I tried reproducing this within Eigen 3.4 but failed. The way I tried to reproduce is by capturing the value of M and v and their derivatives and set up the same solve with Eigen's AutoDiffScalar.
Then I tried to reproduce the same thing in AutoDiffXd as a Drake unit test here. Still can't reproduce in Eigen3.4 + Mac x86.
You can reproduce this on Ubuntu, like so:
I took the code currently in #17038, merged it up to the tip of master, reverted the tamsi implementation fix but kept the new unit test case, switched to Eigen 3.4, and tested with Clang 12 on Ubuntu 20 (env CC=clang-12 CXX=clang++12 bazel test //multibody/plant:multibody_plant_forward_dynamics_test
).
The test failed:
[ RUN ] MultibodyPlantTest.CartPoleLinearization
multibody/plant/test/multibody_plant_forward_dynamics_test.cc:269: Failure
Value of: CompareMatrices(dt_linearization->A(), A_expected, 1e-12)
Actual: false (Values at (0, 2) exceed tolerance: 0.090000000000000011 vs. 0.10000000000000001, diff = 0.009999999999999995, tolerance = 9.9999999999999998e-13
m1 =
1 0.0098100000000000027 0.090000000000000011 0.00045454545454545471
0 1.2158200000000001 -0.22000000000000003 0.11000000000000001
0 0.098100000000000021 0.90000000000000002 0.004545454545454547
0 2.1582000000000003 -2.2000000000000002 1.1000000000000001
m2 =
1 0.0098100000000000027 0.10000000000000001 0
0 1.2158200000000001 0 0.10000000000000001
0 0.098100000000000021 1 0
0 2.1582000000000003 0 1
delta=
0 0 -0.009999999999999995 0.00045454545454545471
0 0 -0.22000000000000003 0.010000000000000009
0 0 -0.099999999999999978 0.004545454545454547
0 0 -2.2000000000000002 0.10000000000000009)
Expected: true
FYI same result with the default GCC (9.4), instead of Clang 12. That probably means it's a bug in the code, rather that the compiler.
Ok, I was able to reproduce in stand-alone Eigen example. Long story short,
Vector<AutoDiffScalar<VectorXd>, 2> x = A.ldlt().solve(b);
doesn't trigger the error, but
Vector<AutoDiffScalar<VectorXd>, Dynamic> x = A.ldlt().solve(b);
does.
I'll clean up and file an issue upstream.
Issue filed upstream at https://gitlab.com/libeigen/eigen/-/issues/2481.
With that, would you consider this issue as resolved? @RussTedrake
There is perhaps one more bit of follow up:
We should probably seek out and fix/destroy any others.
I think the victory condition for that is that we never call Eigen LDLT on AutoDiff from anywhere in Drake, but rather use the drake/math helper function as in #17038 instead.
From my quick skim, most calls to LDLT were on VectorXd
so are safe, but there might be a few still that use autodiff.
Maybe I'm missing something, but I see a large number of LDLT calls in multibody/
alone, and most of them look like they are templated on T
. So I would think we have a lot of possible landmines whenever using AutoDiffXd
?
I would definitely not close this issue until we resolve the ldlt implementation or replace all occurrences. (Even then it seems likely that someone could add a new ldlt and hit this?)
I see only the one in sap_model
but maybe I'm not looking hard enough. In any case, I agree we shouldn't close this until someone finishes an audit.
I guess I am assuming the Eigen::LDLT<...>
will have the same issues as Matrix.ldlt()
. It's true, we need to verify that.
We have a lot of math::LinearSolver<Eigen::LDLT, MatrixUpTo6<T>>
but those are fine, right? If you could link to a few of the ones that you're suspicious of, that might help.
Here's the only suspicious two I've noticed:
Thanks for the input.
Eigen::LDLT<...>
has the same issue, and so does Eigen::LLT<...>
. Eigen::LDLT<>
on matrices with AutoDiffXd?For (3), you could refer to common/symbolic_ldlt.h
for the syntax. I'm not sure it would be worth it, though -- it's plausible that Eigen upstream fixes this quickly (they are already looking into the ticket), in which case the only question is how quickly that fix ships into homebrew (and we might be able to agitate to accelerate that as well).
Thanks for the reference. It could be argued that one would never want to instantiate Eigen::LDLT with a autodiff matrix because it's almost certain that using the linear solver variant is more performant (when it comes to producing derivatives).
For now, I'll PR to clean up the existing landmines, monitor the upstream issue, and revisit this ticket in a few weeks.
Now that the Drake fix has landed (#17057), demoting this issue to low priority just so to remind us to keep track of the progress of the Eigen upstream fix.
Update: merge request filed upstream at https://gitlab.com/libeigen/eigen/-/merge_requests/976.
I updated the title of the issue to be more descriptive of the real issue.
The moral of the story is don't use Eigen::LLT
or Eigen::LDLT
directly with AutoDiffXd
as long as we still support Eigen 3.4.0. Instead, use the drake::math::LinearSolver
variant. From reading the discussion upstream, it looks like the bug happens in the triangular solve, so Eigen::LU
is probably affected too.
Since the bug fix is already merged upstream, it'll most likely make the next Eigen release? Once that happens (and we confirm it), we can close this issue.
My vote would be to just close this now. The next Eigen release is an indeterminate amount of time in the future, and I can't imagine how a merged commit that fixes the problem would not eventually make it into some release. I could imagine a 3.4.1 without it but not a 3.5 or etc.
... as long as we still support Eigen 3.4.0.
That will be for at minimum 4 more years, and a good chance it's somewhat longer. Eigen hasn't tagged a release in many years, and Ubuntu 24.04 will ship with 3.4.0 which means we need to support that until Ubuntu 28.04 is out.
I think that's reasonable. I'll give other people involved in this thread some time to weigh in and close this issue if no one objects.
I observed that I was getting surprisingly different LQR gains for a
MultibodyPlant
with time_step=0 vs time_step=small and spent some time investigating. I eventually winnowed it down to a minimal c++ example on the cart-pole system, for which the Linearize method was giving a bad linearization for the discrete-time variant: https://github.com/RussTedrake/drake/commit/3c424d000e63989690c993a9f3f10bb4aca2cedbThis was the output on my mac (with Eigen 3.4):
I would expect Ad ≈ I + time_stepAc, Bd ≈ time_stepBc. But here Bc[3] == 0, which is just wrong.
Running the exact same sha on ubuntu (with Eigen 3.3) gives the correct anser:
After some debugging, I narrowed it down to the
ldlt
call in tamsi_solver.ccSolveWithGuess
. I added some debugging codewhich yields
That is clearly wrong.
Furthermore, replacing the ldlt() line with
v = M.inverse()*p_star;
resolves the problem and gives me the right A and B matrices.It sure seems like this is a bug in Eigen 3.4 LDLT + AutoDiffXd. We should try to make a minimal Eigen-only repro and then report it. In the short term, I will try replacing this particular LDLT call. We should probably seek out and fix/destroy any others.
Slack conversation: https://drakedevelopers.slack.com/archives/C2WBPQDB7/p1650886581189249