Understanding lcm message delays in drake systems.

pangtao22 commented 3 years ago

In our drake systems implementation of PlanManager, we have observed a delay on the order of milliseconds between receiving a IIWA_STATUS message and sending a IIWA_COMMAND message. We would like to better understand the reason for this delay @RussTedrake.

To isolate the delay introduced by the systems framework from our implementation of StateMachine, IiwaPlanManagerSystem is simplified to a passthough, which only copies the utime field of the status message into the command message, and then sends it.

Time delay between an IIWA_STATUS message and IIWA_COMMAND message is measured as follows:

Start mock_station_simulation and run_plan_manager_system.
Start lcm-logger.
In the log, look for Events corresponding to IIWA_STATUS and IIWA_COMMAND messages, and record Event.timestamp.
Time delay can be computed as the difference between Event.timestamp of an STATUS and COMMAND message pair, in which the two messages have the same utime. The script for computing this delay can be found here.

The average time delay of the systems framework varies from run to run, but is usually between 3 and 8 ms, as shown in the plots below.

In contrast, the average delay of the multi-threaded implementation of PlanManager is between 0.1 and 0.2ms.

All experiments are run on a Mac mini with Big Sur, 6-core 8th gen i7 and 64GB of RAM.

RussTedrake commented 3 years ago

This is great! I'd definitely like to dig into this and resolve it. I've got one more lecture to prep for Tuesday, but then can spend time working on this with you.

FWIW, I have had one issue for 3 years, but I think it's not getting you in this example? https://github.com/RobotLocomotion/drake/issues/9894

pangtao22 commented 3 years ago

I don't think so. As shown in the figure below, the diagram used to measure delays consists only of generic lcm subscriber/publisher and a pass-through (IiwaPlanManagerSystem).

pangtao22 commented 3 years ago

As discussed in today's meeting, I first tried setting Simulator's real-time rate to 0 here. As a result, the rate at which IIWA_COMMAND messages are published jumped from 200Hz to >60000Hz. This is clearly a testament to the systems framework's performance, but the numerous command messages flood the operating system's UDP buffer, which causes almost all status messages to be dropped, making it impossible to measure delay between command and status messages.

It seems that the delay was introduced by using LcmInterfaceSystem to schedule publishing events. Following the example suggested by @RussTedrake , i.e. this issue and one of its example implementation, we may need to call diagram.Publish()manually in a loop upon receiving a IIWA_STATUS message. However, running this example, i.e. //examples/kuka_iiwa_arm/iiwa_controller, publishes IIWA_COMMAND messages at 400Hz instead of 200.

In contrast, //examples/kuka_iiwa_arm/kuka_plan_runner, which doesn't wrap command computation in a Diagram, polishes IIWA_STATUS messages correctly at 200Hz.

RussTedrake commented 3 years ago

If you already have a system in your diagram that will publish on advanceto, then you presumably don't need to call diagram.Publish. If you have both, that would explain why you end up with 400Hz instead of 200Hz?

pangtao22 commented 3 years ago

The more straightforward way to implement lcm-driven loop is to have a diagram like this: , and call Simulator.AdvanceTo() in the loop. Here's the (pseudo)code for this implementation.

// owned_lcm_ is a unique pointer to a DrakeLcm instance. 
// diagram is a pointer to the diagram in the figure above.
// plan_manager is a pointer to the `IiwaPlanManagerSystem` in the figure above.

// Create Simulator and extract contexts. 
drake::systems::Simulator<double> sim(*diagram);
auto& context = sim.get_mutable_context();
auto& context_manager = plan_manager_->GetMyMutableContextFromRoot(&context_diagram);

// wait for first IIWA_STATUS message and record its time.
// There also needs to be a subscriber for ROBOT_PLAN, but it's not shown here for brevity.
status_sub = drake::lcm::Subscriber<drake::lcmt_iiwa_status>>(owned_lcm_.get(), "IIWA_STATUS");
drake::lcm::LcmHandleSubscriptionsUntil(
    owned_lcm_.get(), [&]() { return status_sub_->count() > 0; });
const double t_start = status_sub_->message().utime * 1e-6;

while (true) {
    status_sub->clear();

    // Wait for an IIWA_STATUS message.
    drake::lcm::LcmHandleSubscriptionsUntil(
        owned_lcm_.get(), [&]() { return status_sub->count() > 0; });

    // Update diagram context.
    plan_manager_->get_iiwa_status_input_port().FixValue(
      &context_manager, status_sub->message()); 

   // Call advance to. 
    const double t = status_sub_->message().utime * 1e-6 - t_start;
    sim.AdvanceTo(t);
}

However, scheduling command publishing events with Simulator leads to large amounts of duplicate command messages for the same control tick (left blue peak in the figure below), or missed control ticks (right blue peak).

The figure below plots the time difference between adjacent command and status messages. There is one status message every 5ms, whose arrival triggers a command message. So we expect a single peak around 5ms for the time difference between adjacent command messages. Having one peak at 0ms and another at 10ms indicates that there are duplicate and missed control ticks.

pangtao22 commented 3 years ago

To improve performance and make use of Simulator's event handling, it's better to remove the LcmPublisherSystem in the diagram and publish messages manually in the lcm-driven loop, while having Simulator handle non-time-critical events such as logging.

Specifically, here's the new diagram without command publishing:

And here's the (pseudo)code of the lcm-driven loop:

// owned_lcm_ is a unique pointer to a DrakeLcm instance. 
// diagram is a pointer to the diagram in the figure above.
// plan_manager is a pointer to the `IiwaPlanManagerSystem` in the figure above.

// Create Simulator and extract contexts. 
drake::systems::Simulator<double> sim(*diagram);
auto& context = sim.get_mutable_context();
auto& context_manager = plan_manager_->GetMyMutableContextFromRoot(&context_diagram);

// wait for first IIWA_STATUS message and record its time.
// There also needs to be a subscriber for ROBOT_PLAN, but it's not shown here for brevity.
status_sub = drake::lcm::Subscriber<drake::lcmt_iiwa_status>>(owned_lcm_.get(), "IIWA_STATUS");
drake::lcm::LcmHandleSubscriptionsUntil(
    owned_lcm_.get(), [&]() { return status_sub_->count() > 0; });
const double t_start = status_sub_->message().utime * 1e-6;

while (true) {
    status_sub->clear();

    // Wait for an IIWA_STATUS message.
    drake::lcm::LcmHandleSubscriptionsUntil(
        owned_lcm_.get(), [&]() { return status_sub->count() > 0; });

    // Update diagram context.
    plan_manager_->get_iiwa_status_input_port().FixValue(
      &context_manager, status_sub->message()); 

   // Call advance to. 
    const double t = status_sub_->message().utime * 1e-6 - t_start;
    sim.AdvanceTo(t);

    // Compute command message and publish.
    const auto& iiwa_cmd_msg = plan_manager_->get_iiwa_command_output_port()
                            .Eval<drake::lcmt_iiwa_command>(context_manager);
    drake::lcm::Publish(owned_lcm_.get(), "IIWA_COMMAND", iiwa_cmd_msg);

}

Now there's only a single peak around 5ms in the time intervals between command messages:

Moreover, the delay between command and status messages is on par with our implementation that doesn't use Systems:

pangtao22 commented 3 years ago

To conclude, it seems that the performance issue raised at the beginning of this issue is not caused by computing commands by evaluating the OutputPort of a Diagram, but by Simulator's event scheduling. The solution, is to handle time-critical events such as command publishing manually in an LCM-driven loop, and let Simulator handle other events by calling AdvanceTo in the loop.

RussTedrake commented 3 years ago

Glad it's resolved. (I suspect that the event system is performant enough, too, but probably harder to use correctly).

hjsuh94 commented 3 years ago

Been reading the thread a bit late (but with excitement!), but FYI @pangtao22 the images seem broken for me. Will try to infer what they are for now.

RobotLocomotion / robot-plan-runner

Understanding lcm message delays in drake systems. #29