RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.29k stars 1.26k forks source link

linear_system: Segfault when using `LinearSystem` as double integrator? #12706

Closed EricCousineau-TRI closed 4 years ago

EricCousineau-TRI commented 4 years ago

Migrating @TobiaMarcucci's Slack post to GitHub:

I’m having a strange issue with LinearSystem. Here is a minimal example: I wire a double integrator with an LQR, I simulate, and at simulator.AdvanceTo() my python kernel dies without any printed error. However, if I write a DoubleIntegrator inheriting directly from VectorSystem everything works. I tried in Colab too, but I got the same issue.


EDIT: C++ code repro, pure output computation, from 6fc1729c5

#include "drake/systems/primitives/linear_system.h"
#include "drake/systems/controllers/linear_quadratic_regulator.h"
#include "drake/systems/framework/diagram_builder.h"

namespace drake {
namespace systems {
namespace {

using Eigen::MatrixXd;
using drake::systems::controllers::LinearQuadraticRegulator;

int DoMain() {
  MatrixXd A(2, 2);
  A << 0, 1, 0, 0;
  MatrixXd B(2, 1);
  B << 0, 1;
  MatrixXd C = MatrixXd::Identity(2, 2);
  MatrixXd D = MatrixXd::Zero(2, 1);

  DiagramBuilder<double> builder;

  auto* plant = builder.AddSystem(
      std::make_unique<LinearSystem<double>>(A, B, C, D));
  MatrixXd Q = MatrixXd::Identity(2, 2);
  MatrixXd R = MatrixXd::Identity(1, 1);
  auto* controller = builder.AddSystem(
      LinearQuadraticRegulator(*plant, Q, R));

  builder.Connect(controller->get_output_port(), plant->get_input_port());
  builder.Connect(plant->get_output_port(), controller->get_input_port());
  auto diagram = builder.Build();

  auto context = diagram->CreateDefaultContext();
  MatrixXd x0(2, 1);
  x0 << 0, 0;
  context->SetContinuousState(x0);
  plant->get_output_port().Eval(plant->GetMyContextFromRoot(*context));

  std::cout << "Done\n";
  return 0;
}

}  // namespace
}  // namespace systems
}  // namespace drake

int main(int, char**) {
  return drake::systems::DoMain();
}

Python Code:

import numpy as np
from pydrake.all import VectorSystem, LinearSystem, LinearQuadraticRegulator, DiagramBuilder, Simulator

A = np.array([[0., 1.], [0., 0.]])
B = np.array([[0.], [1.]])
C = np.eye(2)
D = np.zeros((2, 1))
plant = LinearSystem(A, B, C, D)

controller = LinearQuadraticRegulator(plant, np.eye(2), np.eye(1))

## UNCOMMENT HERE TO MAKE IT WORK
# class DoubleIntegrator(VectorSystem):
#     def __init__(self):
#         VectorSystem.__init__(self, 1, 2, direct_feedthrough=False)
#         self.DeclareContinuousState(2)
#     def DoCalcVectorTimeDerivatives(self, context, u, x, xdot):
#         xdot[0] = x[1]
#         xdot[1] = u
#     def DoCalcVectorOutput(self, context, u, x, y):
#         y[:] = x
# plant = DoubleIntegrator()

builder = DiagramBuilder()
plant = builder.AddSystem(plant)
controller = builder.AddSystem(controller)
builder.Connect(controller.get_output_port(0), plant.get_input_port(0))
builder.Connect(plant.get_output_port(0), controller.get_input_port(0))
diagram = builder.Build()

simulator = Simulator(diagram)
simulator.get_mutable_context().SetContinuousState([0., 0.])
simulator.AdvanceTo(1.)
EricCousineau-TRI commented 4 years ago

Trying to debug it here, but can't make sense of it in pure Python land: 5e93cae4e Will re-implement in C++.

Branch: https://github.com/EricCousineau-TRI/drake/tree/issue-12706-wip

EricCousineau-TRI commented 4 years ago

Same behavior in pure C++: a38242ae6

Did a quick gut-checking on sizes, everything seemed fine... Guessing it's a memory leak of sorts.

EDIT: Nothing turns up in dc8cd6080 when doing bazel run --copt '-DDRAKE_ENABLE_ASSERTS=1' //tmp:double_integrator_cc_test Trying -c dbg

EricCousineau-TRI commented 4 years ago

Doing coredumpctl gdb after the segfault points to potential aliasing here: https://github.com/RobotLocomotion/drake/blob/82e1b2c0aaaf544e0dfb880929eaf40e6ca75032/systems/primitives/affine_system.cc#L298

Most relevant bits of stack trace:

#0  0x0000555eeb7fcfa7 in Eigen::internal::check_for_aliasing<Eigen::Matrix<double, -1, 1, 0, -1, 1>, Eigen::CwiseNullaryOp<Eigen::internal::
scalar_constant_op<double>, Eigen::Matrix<double, -1, 1, 0, -1, 1> > > (                                                                     
    dst=<error reading variable: Cannot access memory at address 0x7fff4d035ff8>,                                                            
    src=<error reading variable: Cannot access memory at address 0x7fff4d035ff0>) at external/eigen/Eigen/src/Core/Transpose.h:392           
...
#20 0x0000555eeba3230d in Eigen::MatrixBase<Eigen::Block<Eigen::Matrix<double, -1, 1, 0, -1, 1>, -1, 1, false> >::operator=<Eigen::CwiseBinar
yOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Product<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Matrix<double, -1, 1, 0, 
-1, 1>, 0> const, Eigen::Matrix<double, -1, 1, 0, -1, 1> const> > (this=0x7fff4d0364d0, other=...)                                           
    at external/eigen/Eigen/src/Core/Assign.h:66                                                                                             
#21 0x0000555eeba2ded2 in drake::systems::AffineSystem<double>::CalcOutputY (this=0x555eec628570, context=..., output_vector=0x555eec63c370) 
    at systems/primitives/affine_system.cc:298                                                                                               
#22 0x0000555eeba3374a in drake::systems::OutputPort<double> const& drake::systems::LeafSystem<double>::DeclareVectorOutputPort<drake::system
s::TimeVaryingAffineSystem<double>, drake::systems::BasicVector<double> >(std::variant<std::__cxx11::basic_string<char, std::char_traits<char
>, std::allocator<char> >, drake::systems::UseDefaultName>, drake::systems::BasicVector<double> const&, void (drake::systems::TimeVaryingAffi
neSystem<double>::*)(drake::systems::Context<double> const&, drake::systems::BasicVector<double>*) const, std::set<drake::TypeSafeIndex<drake
::systems::DependencyTag>, std::less<drake::TypeSafeIndex<drake::systems::DependencyTag> >, std::allocator<drake::TypeSafeIndex<drake::system
s::DependencyTag> > >)::{lambda(drake::systems::Context<double> const&, drake::systems::BasicVector<double>*)#1}::operator()(drake::systems::
Context<double> const&, drake::systems::BasicVector<double>*) const (__closure=0x555eec6298f0, context=..., result=0x555eec63c370)           
    at bazel-out/k8-dbg/bin/systems/framework/_virtual_includes/leaf_system/drake/systems/framework/leaf_system.h:1775                       
...

My initial naive guess is that the LinearSystem being used as a double-integrator creates an undetected algebraic loop - I'm guess this uses the symbolic inspector for checking the direct feedthrough, it says there aren't dependencies, but then the actual linear algebra may have inter-dependencies? In the case of reimplementing with VectorSystem, it's explicitly sparse, and thus doesn't have aliasing issues?

@sherm1 Does this sound like a plausible scenario?

\cc @RussTedrake

EricCousineau-TRI commented 4 years ago

I've hit my time limit on this, but it seems like an problem specific to the systems framework. Reassigning to @sherm1, will letcha reassign as desired.

jwnimmer-tri commented 4 years ago

Yes, because MatrixXd D = MatrixXd::Zero(2, 1); the system is reporting that it is not symbolically direct feedthrough, but at runtime the output calculation is pulling the input (so that it can multiply it by zero).

EricCousineau-TRI commented 4 years ago

Gotcha, so disparity between the symbolic math and the actual implementation.

What would be the proper way Tobia should proceed in this regard? For simple double-integration, keep his workaround using the sparse math implementation (extending VectorSystem, rather than using LinearSystem)?

Is there anything to think about for the long-term? It seems like a difficult problem to detect... but on the other hand, it's a super basic workflow to have a user thrown for a loop? (and be so hard to initially debug, at least for me)

jwnimmer-tri commented 4 years ago

Yes, this comes up every once in a while. I thought we had an issue (or already implemented) to fail-fast if we hit a computational cycle during diagram evaluation, but I couldn't immediately find it.

We should repair LinearSystem (really, AffineSystem). It should be more careful when evaluating its input. If the D matrix is exactly zero, then it should not compute the D * u term during output. (Since D is a member field, we can pre-compute whether or not D is zero.)

sherm1 commented 4 years ago

Per discussion in #12253, IMO the punt-to-symbolic dependency analysis is more trouble than it's worth and we should not do it. Instead we should just believe the output port declared dependencies, which I'm sure in this case would have said "pass through", giving an annoying but helpful error message. AffineSystem should then be modified to declare its true dependencies (i.e., depends on only those u's corresponding to non-zero columns of D).

The under-the-cover transmogrify-to-symbolic most recently made life difficult for @ggould-tri in #12697. I hadn't realized until the present case that in addition to being complicated and causing compilation difficulty, the symbolic analysis can also be wrong in a non-conservative direction! That seals its fate, IMO. We could "repair" it by suppressing simplifications, but I don't think that's the right answer.

jwnimmer-tri commented 4 years ago

It sounds like we agree that AffineSystem's Calc implementation should be repaired.

We can leave #12253 to be discussed in its own issue. However, even if port feedthrough annotations are changed per that issue, a user could still make a mistake and introduce a computational loop that was not declared in the sparsity. So, we should still consider failing-fast when we encounter an evaluation loop.