Drake's simulate() Method Needs to Support Externally Specified Termination Condition

liangfok commented 8 years ago

Currently, the termination condition is when the simulation time t exceeds the final time tf. The relevant source code is in drake/systems/Simulation.h#L118 and is replicated below:

while (t < tf) {
  //...
}

This is not sufficiently flexible when integrating Drake within a larger framework like ROS since ROS is multithreaded and installs its own SIGINT handler. When running Drake with ROS, typing ctrl+c will no longer shutdown the simulation. We need to design a mechanism for the above while loop to effectively implement the following logic:

while (t < tf && ros::ok()) {
  // ...
}

The ability to achieve the above logic will return the normal ability to terminate the simulation when ctrl+c is typed.

Without this fix, the only way to shut down Drake + ROS is to type ctr+\, which is potentially dangerous since it does not shut things down cleanly (it may result in zombie processes that you will then need to pick off one-by-one using something like kill -9).

liangfok commented 8 years ago

@david-german-tri: This need is something to keep in mind when Designing Framework 2.0.

sherm1 commented 8 years ago

The ultimate hybrid continuous/discrete solver will support events, and the handlers for events can signal termination. This won't require much from System 2.0 but will be mostly Solver 2.0.

liangfok commented 8 years ago

@sherm1: Just wondering how event support would help in this context. Will a separate thread need to be spawned that periodically checks the return value of ros::ok(), and emit a signal to Solver 2.0 when the return value is false?

Or, will Solver 2.0 embrace ROS and use a custom SIGINT handler that plays nice with ROS as described here?

sherm1 commented 8 years ago

I don't think Solver 2.0 will have any particular relationship with ROS. Its job will be to accurately advance continuous and discrete state variables through time in accordance with a set of equations. The continuous part requires numerical integration with accuracy control and localization of events, at which point discrete variables are updated by event handlers (including simple discrete update equations). This is a mathematical procedure, not a software stack. During a discrete update, termination can be requested. That could come from anywhere, including ROS, but that would be invisible to the solver that I'm thinking about. I think you have a bigger picture in mind than I do; at the moment I'm thinking about how we formulate a hybrid dynamical system and solve it efficiently.

Sigh -- would likely be a short conversation in person with a whiteboard!

liangfok commented 8 years ago

OK, this was actually the first time I've heard of the term "Solver 2.0" and I wasn't sure what it's relationship was with System 2.0. It sounds like Solver 2.0 is a simulator that supports both continuous and discrete state transitions, and that it will eventually be implemented using the software abstractions provided by System 2.0. Are these correct guesses?

Is "Solver 1.0" precisely the while loop I mentioned in the original description of this issue?

I'm happy to chat over Google Hangout or Skype anytime.

sherm1 commented 8 years ago

this was actually the first time I've heard of the term "Solver 2.0"

That's cuz I just made it up during this discussion! I like the name though. And yeah, Solver 1.0 would be the current while loop with a hacked-together fixed step RK2 integrator. But the original Drake solver is the Simulink one which is a very nice model for Solver 2.0.

Please feel free to call me via hangout any time you want to chat -- no appointment needed! If I can't talk I'll call you back.

jwnimmer-tri commented 8 years ago

I'm not really sure how much the Solver really comes into play here, either 1.0 or 2.0. Really, we just want to have a simulate() method where the "health check" part of its main loop is customizable.

Right now, body of the current while loop prior to const double dt = ... is doing a handle_realtime_factor check with some related options handling. That should all just be ExtractMethod'd into a bool-valued keep_going functor, which then @liangfok could replace with his functor that calls ros::ok. All in all, a relatively straightforward refactoring.

It would also be a good opportunity to better collect the handle_realtime-related code into a more cohesive class with code in a C file (not the header).

liangfok commented 8 years ago

@jwnimmer-tri: Great idea. I will do the refactor in the morning. Thanks!

liangfok commented 8 years ago

I think you have a bigger picture in mind than I do; at the moment I'm thinking about how we formulate a hybrid dynamical system and solve it efficiently.

This is certainly possible. I am eager to scale up Drake to simulate massive systems like a city in which millions of autonomous vehicles, manually-controlled vehicles, pedestrians, bicyclists, personal robots, delivery robots, etc, peacefully coexist with each other. I suspect achieving high fidelity simulation of just a single autonomous vehicle will stretch the limits of a single PC's computational resources, and am thus hoping that the architecture we come up with can be easily distributed to make use of more resources. It would be great if we could design System 2.0 / Solver 2.0 / Drake 2.0 with scalability in mind. Maybe a piece of the puzzle is to embrace a component-based architecture like that provided by ROS 2.0.

I think the biggest challenge will be to achieve a distributed simulation while maintaining Drake's "glass box" (all state variables in a single vector) nature.

liangfok commented 8 years ago

@sherm1 I need to resolve this issue as a high priority item for https://github.com/RobotLocomotion/drake/issues/2606. Maybe I could offload this from you? What's the current status on this?

Maybe I can address this issue in the short term by incrementally refactoring the existing code and then you follow up with an event-based longer term solution in Solver 2.0?

sherm1 commented 8 years ago

OK, @liangfok. What do you think of this approach: add an optional bool std::function to SimulateOptions and then change the loop to while (t < tf && !options.should_stop()) {}? The default value for should_stop would be something like [](){return false;} but you could set it to [](){return !ros::ok();}.

Some design questions: what (if any) arguments should be passed to the function? Does there have to be some way to record the reason that the simulation was stopped?

jwnimmer-tri commented 8 years ago

Putting it in SimulationOptions makes it more difficult to change later, compared to a separate function argument to simulate().

sherm1 commented 8 years ago

Putting it in SimulationOptions makes it more difficult to change later, compared to a separate function argument to simulate().

I don't see why -- please explain. Putting it in Options has the advantage of leaving the API unchanged, and Solver 2 will handle this differently.

liangfok commented 8 years ago

Thanks @sherm1. +1 for adding a bool std::function into SimulationOptions that enables customizing how a simulation can be terminated.

What (if any) arguments should be passed to the function?

None for this initial refactor (I'm aiming to minimize PR size). In the future, I can imagine passing it a clock (to support arbitrary clock sources and enforcement of a desired real-time factor), the simulation state (to detect convergence or live lock), etc.

Does there have to be some way to record the reason that the simulation was stopped?

I don't think it's necessary to inform the calling process of why the simulation stopped. Distinguishing between stopping due to SIGINT (ctrl+c) or the final simulation time being reached is not necessary, at least for the current Solver 1.0 implementation and applications. Maybe Solver 2.0 can be designed to support this differentiation if there is such a need.

jwnimmer-tri commented 8 years ago

I don't see why -- please explain. Putting it in Options has the advantage of leaving the API unchanged, and Solver 2 will handle this differently.

As I said upthread weeks ago, a default argument can be used to keep the existing call sites unchanged. By making it an argument, we get function overloading, which allows for different overloads, e.g., depending on what types the callback functor wants.

liangfok commented 8 years ago

I am OK making the functor a new parameter of simulate(...). @sherm1, should I proceed to do that?

sherm1 commented 8 years ago

I don't understand the overload argument; there is only one actual simulation loop, which will require a particular method signature to call in the while condition. That could be the signature stored in Options. I'm also wary of modifying the signature to add behavior tweaks since it isn't easily scalable if this is just one of many more features we'll want (likely, IMO). Options provides a reasonable loophole to collect a potentially bottomless pit of behavior tweaks.

However, I don't feel strongly about it so if you guys prefer the extra argument (with proper default) I can live with that.

liangfok commented 8 years ago

Yeah, I think the initial time (ti), final time (tf), initial system state (xi) should all go into SimulationOptions. This would simplify the simulate() method signature to be a nice and concise:

template <typename System>
void simulate(const System& sys, const SimulationOptions& options) {
  // ...
}

This refactor probably shouldn't go into a single PR. I just wanted to know your thoughts about that.

jwnimmer-tri commented 8 years ago

I don't understand the overload argument; there is only one actual simulation loop, which will require a particular method signature to call in the while condition.

Okay, that's a good point. I retract my suggestion.

sherm1 commented 8 years ago

OK, thanks @jwnimmer-tri. @liangfok, time and state will be collected into the Context in System 2.0 (not options) so we probably don't want to mess with them here. But please go ahead with adding a termination function to SimulationOptions.

liangfok commented 8 years ago

Great! OK starting on a PR with just a new termination function in SimulationOptions. Thanks!

RobotLocomotion / drake

Drake's simulate() Method Needs to Support Externally Specified Termination Condition #2192