Closed pangtao22 closed 3 years ago
This is great! I'd definitely like to dig into this and resolve it. I've got one more lecture to prep for Tuesday, but then can spend time working on this with you.
FWIW, I have had one issue for 3 years, but I think it's not getting you in this example? https://github.com/RobotLocomotion/drake/issues/9894
I don't think so. As shown in the figure below, the diagram used to measure delays consists only of generic lcm subscriber/publisher and a pass-through (IiwaPlanManagerSystem
).
As discussed in today's meeting, I first tried setting Simulator
's real-time rate to 0 here. As a result, the rate at which IIWA_COMMAND
messages are published jumped from 200Hz to >60000Hz. This is clearly a testament to the systems framework's performance, but the numerous command messages flood the operating system's UDP buffer, which causes almost all status messages to be dropped, making it impossible to measure delay between command and status messages.
It seems that the delay was introduced by using LcmInterfaceSystem
to schedule publishing events. Following the example suggested by @RussTedrake , i.e. this issue and one of its example implementation, we may need to call diagram.Publish()
manually in a loop upon receiving a IIWA_STATUS
message. However, running this example, i.e. //examples/kuka_iiwa_arm/iiwa_controller
, publishes IIWA_COMMAND
messages at 400Hz instead of 200.
In contrast, //examples/kuka_iiwa_arm/kuka_plan_runner
, which doesn't wrap command computation in a Diagram
, polishes IIWA_STATUS
messages correctly at 200Hz.
If you already have a system in your diagram that will publish on advanceto, then you presumably don't need to call diagram.Publish. If you have both, that would explain why you end up with 400Hz instead of 200Hz?
The more straightforward way to implement lcm-driven loop is to have a diagram
like this:
,
and call Simulator.AdvanceTo()
in the loop. Here's the (pseudo)code for this implementation.
// owned_lcm_ is a unique pointer to a DrakeLcm instance.
// diagram is a pointer to the diagram in the figure above.
// plan_manager is a pointer to the `IiwaPlanManagerSystem` in the figure above.
// Create Simulator and extract contexts.
drake::systems::Simulator<double> sim(*diagram);
auto& context = sim.get_mutable_context();
auto& context_manager = plan_manager_->GetMyMutableContextFromRoot(&context_diagram);
// wait for first IIWA_STATUS message and record its time.
// There also needs to be a subscriber for ROBOT_PLAN, but it's not shown here for brevity.
status_sub = drake::lcm::Subscriber<drake::lcmt_iiwa_status>>(owned_lcm_.get(), "IIWA_STATUS");
drake::lcm::LcmHandleSubscriptionsUntil(
owned_lcm_.get(), [&]() { return status_sub_->count() > 0; });
const double t_start = status_sub_->message().utime * 1e-6;
while (true) {
status_sub->clear();
// Wait for an IIWA_STATUS message.
drake::lcm::LcmHandleSubscriptionsUntil(
owned_lcm_.get(), [&]() { return status_sub->count() > 0; });
// Update diagram context.
plan_manager_->get_iiwa_status_input_port().FixValue(
&context_manager, status_sub->message());
// Call advance to.
const double t = status_sub_->message().utime * 1e-6 - t_start;
sim.AdvanceTo(t);
}
However, scheduling command publishing events with Simulator
leads to large amounts of duplicate command messages for the same control tick (left blue peak in the figure below), or missed control ticks (right blue peak).
The figure below plots the time difference between adjacent command and status messages. There is one status message every 5ms, whose arrival triggers a command message. So we expect a single peak around 5ms for the time difference between adjacent command messages. Having one peak at 0ms and another at 10ms indicates that there are duplicate and missed control ticks.
To improve performance and make use of Simulator
's event handling, it's better to remove the LcmPublisherSystem
in the diagram and publish messages manually in the lcm-driven loop, while having Simulator
handle non-time-critical events such as logging.
Specifically, here's the new diagram without command publishing:
And here's the (pseudo)code of the lcm-driven loop:
// owned_lcm_ is a unique pointer to a DrakeLcm instance.
// diagram is a pointer to the diagram in the figure above.
// plan_manager is a pointer to the `IiwaPlanManagerSystem` in the figure above.
// Create Simulator and extract contexts.
drake::systems::Simulator<double> sim(*diagram);
auto& context = sim.get_mutable_context();
auto& context_manager = plan_manager_->GetMyMutableContextFromRoot(&context_diagram);
// wait for first IIWA_STATUS message and record its time.
// There also needs to be a subscriber for ROBOT_PLAN, but it's not shown here for brevity.
status_sub = drake::lcm::Subscriber<drake::lcmt_iiwa_status>>(owned_lcm_.get(), "IIWA_STATUS");
drake::lcm::LcmHandleSubscriptionsUntil(
owned_lcm_.get(), [&]() { return status_sub_->count() > 0; });
const double t_start = status_sub_->message().utime * 1e-6;
while (true) {
status_sub->clear();
// Wait for an IIWA_STATUS message.
drake::lcm::LcmHandleSubscriptionsUntil(
owned_lcm_.get(), [&]() { return status_sub->count() > 0; });
// Update diagram context.
plan_manager_->get_iiwa_status_input_port().FixValue(
&context_manager, status_sub->message());
// Call advance to.
const double t = status_sub_->message().utime * 1e-6 - t_start;
sim.AdvanceTo(t);
// Compute command message and publish.
const auto& iiwa_cmd_msg = plan_manager_->get_iiwa_command_output_port()
.Eval<drake::lcmt_iiwa_command>(context_manager);
drake::lcm::Publish(owned_lcm_.get(), "IIWA_COMMAND", iiwa_cmd_msg);
}
Now there's only a single peak around 5ms in the time intervals between command messages:
Moreover, the delay between command and status messages is on par with our implementation that doesn't use Systems
:
To conclude, it seems that the performance issue raised at the beginning of this issue is not caused by computing commands by evaluating the OutputPort
of a Diagram
, but by Simulator
's event scheduling. The solution, is to handle time-critical events such as command publishing manually in an LCM-driven loop, and let Simulator
handle other events by calling AdvanceTo
in the loop.
Glad it's resolved. (I suspect that the event system is performant enough, too, but probably harder to use correctly).
Been reading the thread a bit late (but with excitement!), but FYI @pangtao22 the images seem broken for me. Will try to infer what they are for now.
In our drake systems implementation of
PlanManager
, we have observed a delay on the order of milliseconds between receiving a IIWA_STATUS message and sending a IIWA_COMMAND message. We would like to better understand the reason for this delay @RussTedrake.To isolate the delay introduced by the systems framework from our implementation of
StateMachine
,IiwaPlanManagerSystem
is simplified to a passthough, which only copies the utime field of the status message into the command message, and then sends it.Time delay between an IIWA_STATUS message and IIWA_COMMAND message is measured as follows:
mock_station_simulation
andrun_plan_manager_system
.lcm-logger
.Event
s corresponding to IIWA_STATUS and IIWA_COMMAND messages, and recordEvent.timestamp
.Event.timestamp
of an STATUS and COMMAND message pair, in which the two messages have the sameutime
. The script for computing this delay can be found here.The average time delay of the systems framework varies from run to run, but is usually between 3 and 8 ms, as shown in the plots below.
In contrast, the average delay of the multi-threaded implementation of
PlanManager
is between 0.1 and 0.2ms.All experiments are run on a Mac mini with Big Sur, 6-core 8th gen i7 and 64GB of RAM.