RobotLocomotion / drake

Model-based design and verification for robotics.
https://drake.mit.edu
Other
3.28k stars 1.26k forks source link

Segmentation Fault when Calling diagram.ForcedPublish on Context with Inf values #21219

Open cohnt opened 6 months ago

cohnt commented 6 months ago

What happened?

I encountered this bug, and was able to reduce it to this MWE.

import time
import numpy as np
from pydrake.all import (
    StartMeshcat,
    DiagramBuilder,
    AddMultibodyPlantSceneGraph,
    Parser,
    ProcessModelDirectives,
    MeshcatVisualizer,
    MeshcatVisualizerParams,
    Role
)

directives = """
# Model directive for iiwa with sphere collision and welded gripper
directives:

# Add iiwa
- add_model:
    name: iiwa
    file: package://drake/manipulation/models/iiwa_description/urdf/iiwa14_spheres_collision.urdf

- add_weld:
    parent: world
    child: iiwa::base

# Add schunk
- add_model:
    name: wsg
    file: package://drake/manipulation/models/wsg_50_description/sdf/schunk_wsg_50_welded_fingers.sdf

- add_frame:
    name: iiwa::wsg_attach
    X_PF:
      base_frame: iiwa::iiwa_link_7
      translation: [0, 0, 0.114]
      rotation: !Rpy { deg: [90.0, 0.0, 0.0 ]}

- add_weld:
    parent: iiwa::wsg_attach
    child: wsg::body
"""

def build_env(meshcat, directives):
    builder = DiagramBuilder()

    plant, scene_graph = AddMultibodyPlantSceneGraph(builder, time_step=0.0)

    parser = Parser(plant)
    parser.AddModelsFromString(directives, ".dmd.yaml")

    plant.Finalize()

    meshcat_visual_params = MeshcatVisualizerParams()
    meshcat_visual_params.delete_on_initialization_event = False
    meshcat_visual_params.role = Role.kIllustration
    meshcat_visual_params.prefix = "visual"
    meshcat_visual = MeshcatVisualizer.AddToBuilder(
        builder, scene_graph, meshcat, meshcat_visual_params)

    meshcat_collision_params = MeshcatVisualizerParams()
    meshcat_collision_params.delete_on_initialization_event = False
    meshcat_collision_params.role = Role.kProximity
    meshcat_collision_params.prefix = "collision"
    meshcat_collision_params.visible_by_default = False
    meshcat_collision = MeshcatVisualizer.AddToBuilder(
        builder, scene_graph, meshcat, meshcat_collision_params)

    meshcat_visual.Delete()
    meshcat_collision.Delete()

    diagram = builder.Build()

    return diagram, plant, scene_graph

meshcat = StartMeshcat()
diagram, plant, scene_graph = build_env(meshcat, directives)

diagram_context = diagram.CreateDefaultContext()
plant_context = plant.GetMyContextFromRoot(diagram_context)

model_instance = plant.GetModelInstanceByName("iiwa")
indices = plant.GetActuatedJointIndices(model_instance)
joint_names = []
for idx in indices:
    joint = plant.get_joint(idx)
    meshcat.AddSlider(joint.name(),
                      value=0,
                      min=joint.position_lower_limits()[0],
                      max=joint.position_upper_limits()[0],
                      step=0.01)
    joint_names.append(joint.name())

meshcat.AddButton("Stop")
while meshcat.GetButtonClicks("Stop") < 1:
    q = [meshcat.GetSliderValue(joint_name) for joint_name in joint_names]
    q[0] = np.inf
    plant.SetPositions(plant_context, q)
    diagram.ForcedPublish(diagram_context)
    time.sleep(0.05)

Version

No response

What operating system are you using?

No response

What installation option are you using?

No response

Relevant log output

No response

RussTedrake commented 6 months ago

Here's a slightly more minimal reproduction:

import numpy as np
from pydrake.all import (
    StartMeshcat,
    DiagramBuilder,
    AddMultibodyPlantSceneGraph,
    Parser,
    MeshcatVisualizer,
)

directives = """
# Model directive for iiwa with sphere collision and welded gripper
directives:

# Add iiwa
- add_model:
    name: iiwa
    file: package://drake/manipulation/models/iiwa_description/urdf/iiwa14_spheres_collision.urdf

- add_weld:
    parent: world
    child: iiwa::base

# Add schunk
- add_model:
    name: wsg
    file: package://drake/manipulation/models/wsg_50_description/sdf/schunk_wsg_50_welded_fingers.sdf

- add_frame:
    name: iiwa::wsg_attach
    X_PF:
      base_frame: iiwa::iiwa_link_7
      translation: [0, 0, 0.114]
      rotation: !Rpy { deg: [90.0, 0.0, 0.0 ]}

- add_weld:
    parent: iiwa::wsg_attach
    child: wsg::body
"""

def build_env(meshcat, directives):
    builder = DiagramBuilder()

    plant, scene_graph = AddMultibodyPlantSceneGraph(builder, time_step=0.0)

    parser = Parser(plant)
    parser.AddModelsFromString(directives, ".dmd.yaml")

    plant.Finalize()

    meshcat_visual = MeshcatVisualizer.AddToBuilder(
        builder, scene_graph, meshcat)

    meshcat_visual.Delete()

    diagram = builder.Build()

    return diagram, plant, scene_graph

meshcat = StartMeshcat()
diagram, plant, scene_graph = build_env(meshcat, directives)

diagram_context = diagram.CreateDefaultContext()
plant_context = plant.GetMyContextFromRoot(diagram_context)
q = plant.GetPositions(plant_context)
q[0] = np.inf

for i in range(3):
    plant.SetPositions(plant_context, q)
    diagram.ForcedPublish(diagram_context)
cohnt commented 6 months ago

Yes, that example is more minimal than mine. (Perhaps "minimal" working example is a bit of a misnomer!)

Note also that q[0] = np.nan also causes a segfault.

RussTedrake commented 6 months ago

I made a quick c++ reproduction (which can run in meshcat_visualizer_test.cc):

GTEST_TEST(MeshcatVisualizerTest, Issue21219) {
std::string directives = R"""(
directives:
- add_model:
    name: iiwa
    file: package://drake/manipulation/models/iiwa_description/urdf/iiwa14_spheres_collision.urdf
- add_weld:
    parent: world
    child: iiwa::base
- add_model:
    name: wsg
    file: package://drake/manipulation/models/wsg_50_description/sdf/schunk_wsg_50_welded_fingers.sdf
- add_frame:
    name: iiwa::wsg_attach
    X_PF:
      base_frame: iiwa::iiwa_link_7
      translation: [0, 0, 0.114]
      rotation: !Rpy { deg: [90.0, 0.0, 0.0 ]}
- add_weld:
    parent: iiwa::wsg_attach
    child: wsg::body
)""";

  auto meshcat = std::make_shared<Meshcat>();
    systems::DiagramBuilder<double> builder;
  auto [plant, scene_graph] = AddMultibodyPlantSceneGraph(&builder, 0.0);
    multibody::Parser parser(&plant);
    parser.AddModelsFromString(directives, ".dmd.yaml");
    plant.Finalize();

    MeshcatVisualizer<double>::AddToBuilder(&builder, scene_graph, meshcat);

    auto diagram = builder.Build();

  auto diagram_context = diagram->CreateDefaultContext();
  systems::Context<double>* plant_context = &plant.GetMyMutableContextFromRoot(diagram_context.get());
  Eigen::VectorXd q = plant.GetPositions(*plant_context);
  q[0] = std::numeric_limits<double>::infinity();

  for (int i=0; i<3; ++i) {
      plant.SetPositions(plant_context, q);
      diagram->ForcedPublish(*diagram_context);
  }
}

It throws in RotationMatrix on this line, which seems like it's the intended behavior. The stack trace is coming through the Multibody::CalcPositionKinematicsCache().

Assuming that we think it's reasonable for the kinematics code to throw when a joint angle is inf or nan (not sure what it could do better?), then the main question is: why do we segfault instead of throwing the exception?

In c++, when I run in debug mode, i see:

[ RUN      ] MeshcatVisualizerTest.Issue21219
[2024-03-29 17:27:29.970] [console] [info] Meshcat listening for connections at http://localhost:7001
unknown file: Failure
C++ exception with description "Error: Rotation matrix contains an element that is infinity or NaN." thrown in the test body.

but in release mode, i see

[ RUN      ] MeshcatVisualizerTest.Issue21219
[2024-03-29 17:27:42.135] [console] [info] Meshcat listening for connections at http://localhost:7001
Segmentation fault (core dumped)

I guess this is probably because of some compiler optimization?

Is that something we can/should fix? Maybe @jwnimmer-tri knows?

xuchenhan-tri commented 6 months ago

It could be that the throw is inside a DRAKE_ASSERT_VOID guard and is only triggered in debug and the segfault happens later on.

RussTedrake commented 6 months ago

Good call @xuchenhan-tri . That's exactly what's happening. I'll look for the next offender.

RussTedrake commented 6 months ago

Ok. The segfault is in fcl::DynamicAABBTreeCollisionManager, called from SceneGraph<T>::CalcPoseUpdate. Maybe we should actually throw in the kinematics call, even in release mode?

xuchenhan-tri commented 6 months ago

Definitely seems better than segfaulting, but we should probably be careful about where the check/throw happens given how low level rotation matrix is.

jwnimmer-tri commented 6 months ago

FYI #19263 is tangentially related. In that case, calculating using a HalfSpace ended up introducing inf values, which then also segfaulted.

jwnimmer-tri commented 6 months ago

Because RotationMatrix throws (in debug builds) for non-finite values, that means any q that can influence into a rotation must always be finite. Given that, I think it makes sense to say that all of the MultibodyPlant's q vector must always be finite. Therefore, we should throw (even in release builds) when it's not.

Ideally the MbP kinematics would do that, but in case we think those are evaluated too often to pay for the checks, we could imagine that the SG pose update (in prep for a collision check) could do the check.

RussTedrake commented 6 months ago

The DRAKE_ASSERT_VOID that made this a debug-only throw is here. Do I understand your comment as saying that we should not change this line to a release-build throw; e.g. directly in the RotationMatrix class? We could potentially move SetUnchecked() to the public API.

jwnimmer-tri commented 6 months ago

Do I understand your comment as saying that we should not change this line to a release-build throw; ...

Yes, that's right. IIRC the dynamics people are violently against promoting that one to a release-build check.

sherm1 commented 6 months ago

Right! That's a very expensive check.

RussTedrake commented 6 months ago

+@rpoyner-tri for disposition (and resolution?) since our normal component owner is ooo.

jwnimmer-tri commented 4 months ago

We could imagine that the systems framework could offer an option, when a system declares its discrete state, the declaration could mark it as "must be finite" (or "must not be nan" in some cases) and then the framework handles all of the validity checking on behalf of that system. I imagine we want many more systems than just MbP to fail-fast when operating on non-finite values.

Likewise declaring an input (or output) numeric port could declare finite (or non-nan), and the framework / diagram could enforce that. For instance, if an actuation input port was nan.