RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.23k stars 1.25k forks source link

`Eigen::LLT` and `Eigen::LDLT` give bad results for AutoDiffXd in Eigen 3.4 #17037

Open RussTedrake opened 2 years ago

RussTedrake commented 2 years ago

I observed that I was getting surprisingly different LQR gains for a MultibodyPlant with time_step=0 vs time_step=small and spent some time investigating. I eventually winnowed it down to a minimal c++ example on the cart-pole system, for which the Linearize method was giving a bad linearization for the discrete-time variant: https://github.com/RussTedrake/drake/commit/3c424d000e63989690c993a9f3f10bb4aca2cedb

This was the output on my mac (with Eigen 3.4):

drake % ./bazel-bin/examples/multibody/cart_pole/cart_pole_balancing --time_step=0  
A =      0      0      1      0
     0      0      0      1
     0  0.981      0      0
     0 21.582      0      0
B =   0
  0
0.1
0.2
drake % ./bazel-bin/examples/multibody/cart_pole/cart_pole_balancing --time_step=0.1
A =           1     0.00981        0.09 0.000454545
          0     1.21582       -0.22        0.11
          0      0.0981         0.9  0.00454545
          0      2.1582        -2.2         1.1
B = 0.000909091
          0
 0.00909091
          0

I would expect Ad ≈ I + time_stepAc, Bd ≈ time_stepBc. But here Bc[3] == 0, which is just wrong.

Running the exact same sha on ubuntu (with Eigen 3.3) gives the correct anser:

russt@home-cart13:~/drake$ bazel run examples/multibody/cart_pole:cart_pole_balancing -- --time_step=0.1
Starting local Bazel server and connecting to it...
INFO: Invocation ID: 47a7e854-bff0-4885-8b6a-0f94450ad56e
INFO: Analyzed target //examples/multibody/cart_pole:cart_pole_balancing (93 packages loaded, 3054 targets configured).
INFO: Found 1 target...
Target //examples/multibody/cart_pole:cart_pole_balancing up-to-date:
  bazel-bin/examples/multibody/cart_pole/cart_pole_balancing
INFO: Elapsed time: 95.679s, Critical Path: 66.98s
INFO: 407 processes: 147 disk cache hit, 31 internal, 229 linux-sandbox.
INFO: Build completed successfully, 407 total actions
INFO: Build completed successfully, 407 total actions
A =       1 0.00981     0.1       0
      0 1.21582       0     0.1
      0  0.0981       1       0
      0  2.1582       0       1
B = 0.001
0.002
 0.01
 0.02

After some debugging, I narrowed it down to the ldlt call in tamsi_solver.cc SolveWithGuess. I added some debugging code

    // M vˢ⁺¹ = p*.
    v = M.ldlt().solve(p_star);
    if constexpr (std::is_same_v<T, AutoDiffXd>) {
      std::cout << "dp_star = "
              << math::ExtractGradient(p_star)
              << std::endl;
      std::cout << "d(M*v) = "
              << math::ExtractGradient(M*v)
              << std::endl;
    }

which yields

dp_star =      0      0     11   -0.5    0.1
     0 0.4905   -0.5   0.25      0
d(M*v) =           0           0          11        -0.5         0.1
          0      0.4905          -1    0.272727 -0.00454545

That is clearly wrong.

Furthermore, replacing the ldlt() line with v = M.inverse()*p_star; resolves the problem and gives me the right A and B matrices.

It sure seems like this is a bug in Eigen 3.4 LDLT + AutoDiffXd. We should try to make a minimal Eigen-only repro and then report it. In the short term, I will try replacing this particular LDLT call. We should probably seek out and fix/destroy any others.

Slack conversation: https://drakedevelopers.slack.com/archives/C2WBPQDB7/p1650886581189249

jwnimmer-tri commented 2 years ago

Possibly drake/math/linear_solve.h (or #15874) is also relevant.

\CC @hongkai-dai FYI

RussTedrake commented 2 years ago

Yes. I'm planning to use #15874 in my short term fix.

xuchenhan-tri commented 2 years ago

Seems like @amcastro-tri is quite involved on this issue. Do you already have some Eigen-only repro? If not, do you have plans to work on this more?

Otherwise, I can take a look this week.

sherm1 commented 2 years ago

Russ, it looks like the mass matrix in SolveWithGuess() is 10x10, but I think the cart/pole is only 2 dofs. Did you use a different system to produce the final output above?

RussTedrake commented 2 years ago

@sherm1 -- not sure where you are getting 10x10? The cart pole was the system under test for all of these. There are 5 derivatives (4 states and 1 control input).

RussTedrake commented 2 years ago

@xuchenhan-tri -- would love if you want to take it!

xuchenhan-tri commented 2 years ago

If @amcastro-tri doesn't already have some work in progress or doesn't have the bandwidth to follow through with this, I can investigate more later this week and report back.

amcastro-tri commented 2 years ago

I definitely won't have time to work on this issue this week @xuchenhan-tri. If you think you can make progress this week, please feel free to re-assign. If anything, I can help with reviews.

amcastro-tri commented 2 years ago

Should we rename or close and open another issue? #17038, already solves the problem in the title for this issue.

sherm1 commented 2 years ago

@sherm1 -- not sure where you are getting 10x10? The cart pole was the system under test for all of these. There are 5 derivatives (4 states and 1 control input).

Sorry, misread the output! When you say "5 derivatives" do you mean partials of p_star[0] and p_star[1] with respect to 5 variables? So dp=[∂p₀/∂q₀ ∂p₀/∂q₁ ∂p₀/∂v₀ ... ∂p₁/∂u] (in some order)?

xuchenhan-tri commented 2 years ago

I was able to reproduce @RussTedrake's failure in Mac x86.

I tried reproducing this within Eigen 3.4 but failed. The way I tried to reproduce is by capturing the value of M and v and their derivatives and set up the same solve with Eigen's AutoDiffScalar.

Then I tried to reproduce the same thing in AutoDiffXd as a Drake unit test here. Still can't reproduce in Eigen3.4 + Mac x86.

jwnimmer-tri commented 2 years ago

You can reproduce this on Ubuntu, like so:

I took the code currently in #17038, merged it up to the tip of master, reverted the tamsi implementation fix but kept the new unit test case, switched to Eigen 3.4, and tested with Clang 12 on Ubuntu 20 (env CC=clang-12 CXX=clang++12 bazel test //multibody/plant:multibody_plant_forward_dynamics_test).

The test failed:

[ RUN      ] MultibodyPlantTest.CartPoleLinearization
multibody/plant/test/multibody_plant_forward_dynamics_test.cc:269: Failure
Value of: CompareMatrices(dt_linearization->A(), A_expected, 1e-12)
  Actual: false (Values at (0, 2) exceed tolerance: 0.090000000000000011 vs. 0.10000000000000001, diff = 0.009999999999999995, tolerance = 9.9999999999999998e-13
m1 =
                     1  0.0098100000000000027   0.090000000000000011 0.00045454545454545471
                     0     1.2158200000000001   -0.22000000000000003    0.11000000000000001
                     0   0.098100000000000021    0.90000000000000002   0.004545454545454547
                     0     2.1582000000000003    -2.2000000000000002     1.1000000000000001
m2 =
                    1 0.0098100000000000027   0.10000000000000001                     0
                    0    1.2158200000000001                     0   0.10000000000000001
                    0  0.098100000000000021                     1                     0
                    0    2.1582000000000003                     0                     1
delta=
                     0                      0  -0.009999999999999995 0.00045454545454545471
                     0                      0   -0.22000000000000003   0.010000000000000009
                     0                      0  -0.099999999999999978   0.004545454545454547
                     0                      0    -2.2000000000000002    0.10000000000000009)
Expected: true
jwnimmer-tri commented 2 years ago

FYI same result with the default GCC (9.4), instead of Clang 12. That probably means it's a bug in the code, rather that the compiler.

xuchenhan-tri commented 2 years ago

Ok, I was able to reproduce in stand-alone Eigen example. Long story short,

Vector<AutoDiffScalar<VectorXd>, 2> x = A.ldlt().solve(b); doesn't trigger the error, but Vector<AutoDiffScalar<VectorXd>, Dynamic> x = A.ldlt().solve(b); does.

I'll clean up and file an issue upstream.

xuchenhan-tri commented 2 years ago

Issue filed upstream at https://gitlab.com/libeigen/eigen/-/issues/2481.

With that, would you consider this issue as resolved? @RussTedrake

jwnimmer-tri commented 2 years ago

There is perhaps one more bit of follow up:

We should probably seek out and fix/destroy any others.

I think the victory condition for that is that we never call Eigen LDLT on AutoDiff from anywhere in Drake, but rather use the drake/math helper function as in #17038 instead.

From my quick skim, most calls to LDLT were on VectorXd so are safe, but there might be a few still that use autodiff.

RussTedrake commented 2 years ago

Maybe I'm missing something, but I see a large number of LDLT calls in multibody/ alone, and most of them look like they are templated on T. So I would think we have a lot of possible landmines whenever using AutoDiffXd?

I would definitely not close this issue until we resolve the ldlt implementation or replace all occurrences. (Even then it seems likely that someone could add a new ldlt and hit this?)

jwnimmer-tri commented 2 years ago

I see only the one in sap_model but maybe I'm not looking hard enough. In any case, I agree we shouldn't close this until someone finishes an audit.

RussTedrake commented 2 years ago

I guess I am assuming the Eigen::LDLT<...> will have the same issues as Matrix.ldlt(). It's true, we need to verify that.

jwnimmer-tri commented 2 years ago

We have a lot of math::LinearSolver<Eigen::LDLT, MatrixUpTo6<T>> but those are fine, right? If you could link to a few of the ones that you're suspicious of, that might help.

Here's the only suspicious two I've noticed:

https://github.com/RobotLocomotion/drake/blob/39d10b48169ebd697fcfff08761153da9c467486/multibody/contact_solvers/sap/sap_model.cc#L387

https://github.com/RobotLocomotion/drake/blob/39d10b48169ebd697fcfff08761153da9c467486/multibody/plant/tamsi_solver.h#L952

xuchenhan-tri commented 2 years ago

Thanks for the input.

  1. Yes, Eigen::LDLT<...> has the same issue, and so does Eigen::LLT<...>.
  2. I didn't find any more usage of plain Eigen Cholesky other than the two @jwnimmer-tri found above.
  3. Replacing all current occurrence seems like the best we can do so far? As for avoiding this landmine in the future (until fix upstream), is there a way that we can disable the template instantiation of Eigen::LDLT<> on matrices with AutoDiffXd?
jwnimmer-tri commented 2 years ago

For (3), you could refer to common/symbolic_ldlt.h for the syntax. I'm not sure it would be worth it, though -- it's plausible that Eigen upstream fixes this quickly (they are already looking into the ticket), in which case the only question is how quickly that fix ships into homebrew (and we might be able to agitate to accelerate that as well).

xuchenhan-tri commented 2 years ago

Thanks for the reference. It could be argued that one would never want to instantiate Eigen::LDLT with a autodiff matrix because it's almost certain that using the linear solver variant is more performant (when it comes to producing derivatives).

For now, I'll PR to clean up the existing landmines, monitor the upstream issue, and revisit this ticket in a few weeks.

xuchenhan-tri commented 2 years ago

Now that the Drake fix has landed (#17057), demoting this issue to low priority just so to remind us to keep track of the progress of the Eigen upstream fix.

xuchenhan-tri commented 2 years ago

Update: merge request filed upstream at https://gitlab.com/libeigen/eigen/-/merge_requests/976.

xuchenhan-tri commented 5 months ago

I updated the title of the issue to be more descriptive of the real issue.

The moral of the story is don't use Eigen::LLT or Eigen::LDLT directly with AutoDiffXd as long as we still support Eigen 3.4.0. Instead, use the drake::math::LinearSolver variant. From reading the discussion upstream, it looks like the bug happens in the triangular solve, so Eigen::LU is probably affected too.

Since the bug fix is already merged upstream, it'll most likely make the next Eigen release? Once that happens (and we confirm it), we can close this issue.

jwnimmer-tri commented 5 months ago

My vote would be to just close this now. The next Eigen release is an indeterminate amount of time in the future, and I can't imagine how a merged commit that fixes the problem would not eventually make it into some release. I could imagine a 3.4.1 without it but not a 3.5 or etc.

... as long as we still support Eigen 3.4.0.

That will be for at minimum 4 more years, and a good chance it's somewhat longer. Eigen hasn't tagged a release in many years, and Ubuntu 24.04 will ship with 3.4.0 which means we need to support that until Ubuntu 28.04 is out.

xuchenhan-tri commented 5 months ago

I think that's reasonable. I'll give other people involved in this thread some time to weigh in and close this issue if no one objects.