RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.35k stars 1.27k forks source link

Segmentation Fault in KinematicTrajectoryOptimization when Using MinimumDistanceConstraint #20131

Closed cohnt closed 2 weeks ago

cohnt commented 1 year ago

What happened?

When constructing a MinimumDistanceConstraint and applying it at many points along a trajectory (for example, in KinematicTrajectoryOptimization), the way in which the constraint (and its corresponding context) are constructed can lead to errors or segmentation faults for unclear reason. I've done my best to make a standalone example, using code from manipulation. I've highlighted the relevant changes below -- it seems that re-creating the context or the constraint is causing the error.

Original code

    # collision constraints
    collision_constraint = MinimumDistanceConstraint(
        plant, 0.001, plant_context, None, 0.01
    )
    evaluate_at_s = np.linspace(0, 1, 25)
    for s in evaluate_at_s:
        trajopt.AddPathPositionConstraint(collision_constraint, s)

Modification to induce an error

    # collision constraints
    evaluate_at_s = np.linspace(0, 1, 25)
    for s in evaluate_at_s:
        context = diagram.CreateDefaultContext()
        plant_context = plant.GetMyContextFromRoot(context)
        collision_constraint = MinimumDistanceConstraint(
            plant, 0.001, plant_context, None, 0.01
        )
        trajopt.AddPathPositionConstraint(collision_constraint, s)

Modification to induce a segmentation fault

    # collision constraints
    evaluate_at_s = np.linspace(0, 1, 2)
    for s in evaluate_at_s:
        context = diagram.CreateDefaultContext()
        plant_context = plant.GetMyContextFromRoot(context)
        collision_constraint = MinimumDistanceConstraint(
            plant, 0.001, plant_context, None, 0.01
        )
        trajopt.AddPathPositionConstraint(collision_constraint, s)

Version

No response

What operating system are you using?

Unknown / Other, Ubuntu 22.04

What installation option are you using?

compiled from source code using CMake

Relevant log output

No response

hongkai-dai commented 1 year ago

Could you try one more thing here? To create the contexts outside of the for loop

    # collision constraints
    evaluate_at_s = np.linspace(0, 1, 2)
    contexts = [None] * 2
    plant_contexts = [None] * 2
    for i in len(evaluate_at_s):
        contexts[i] = diagram.CreateDefaultContext()
        plant_contexts[i] = plant.GetMyContextFromRoot(contexts[i])
        collision_constraint = MinimumDistanceConstraint(
            plant, 0.001, plant_contexts[i], None, 0.01
        )
        trajopt.AddPathPositionConstraint(collision_constraint, evaluate_at_s[i])

Do yo still see the segmentation fault?

In C++, if I create a context and plant_context within the loop, then the lifetime of that plant_context is just within the for loop. As soon as the for loop finishes, the context and plant_context are destroyed. I would think pybind does something similar? But it should be fine if the contexts are created and stored outside of the for loop.

cohnt commented 1 year ago

Your change works -- no segmentation fault.

jwnimmer-tri commented 1 year ago

Ideally, the MinimumDistanceConstraint argument plant_context should be annotated to "keep alive" whatever is passed in.

In fact it already seems to do that:

// Keep alive, reference: `self` keeps `plant_context` alive.
py::keep_alive<1, 4>()

Maybe GetMyContextFromRoot is missing an annotation? Nope, it seems to have it as well:

// Keep alive, ownership: `return` keeps `Context` alive.
py::keep_alive<0, 2>()

And AddPathPositionConstraint uses shared_ptr so should ref-count automatically correctly.

Someone will need to dig in and figure out where the reference counting goes awry. I don't immediately see any problem with the bindings.

hongkai-dai commented 1 year ago

I am going to delegate this issue to other pybind gurus. Anybody volunteer?

jwnimmer-tri commented 1 year ago

Per https://drake.mit.edu/issues.html we can assign to the component lead, +@EricCousineau-TRI.

EricCousineau-TRI commented 1 year ago

Ran into possibly related segfault in Anzu, testing things out.

I can reproduce the failure locally using your code (thanks!) https://github.com/RobotLocomotion/drake/tree/3115855343b2e717e008b3e5a3e01ecdc0334e80/tmp (this includes pybind11 upgrade, #20362, but that doesn't seem to change anything as far as this issue is concerned)

From adding trace (per debugging tips), I see the segfault happen as so:

$ bazel-bin/tmp/repro
...
repro.py(169):         traj = BsplineTrajectory(
repro.py(172):         values = traj.vector_values(np.linspace(0, 1, 50))
repro.py(173):         print(values)
...
repro.py(174):         meshcat.SetLine("positions_path", values)
Segmentation fault (core dumped)

With failure variant (25 control points), and naming the systems, I see:

$ bazel-bin/tmp/repro
...
repro.py(184):         meshcat.SetLine("positions_path", values)
RuntimeError: A function call on a drake::multibody::MultibodyPlant<double> system named '::diagram::plant' was passed the Context of a system named '::_' instead of the appropriate subsystem Context.
EricCousineau-TRI commented 1 year ago

Looks like the shared_ptr<> object is either getting sliced or lost; see #20380 (91a283c287)

EricCousineau-TRI commented 1 year ago

Confirmed, posted a new issue w/ writeup: #20383

jwnimmer-tri commented 2 weeks ago

Closing as duplicate of #20383.