ROSS-org / ROSS

Rensselaer's Optimistic Simulation System
http://ross-org.github.io
BSD 3-Clause "New" or "Revised" License
95 stars 46 forks source link

ROSS hangs when out of event memory #6

Closed JohnPJenkins closed 10 years ago

JohnPJenkins commented 10 years ago

Hi all,

When out of event memory in optimistic mode, the message "WARNING: No free event buffers. Try increasing memory via the --extramem option" is printed to stdout and what looks like memory recollection via forcing a GVT update is attempted (tw-sched.c, lines 177-185). However, if no memory is able to be recollected, then the program seems to enter an infinite loop of checking for free mem and running the GVT. Any way to detect this behavior and terminate the program?

laprej commented 10 years ago

Not really. It's basically the Halting Problem. We can't guarantee that an event won't get freed by running tw_gvt_force_update() one more time, but after 10 iterations, it's looking doubtful... But you never know! At least we have the warning now. Back in my day we didn't even have that! :D I suppose instead of printing the warning you could call tw_exit()?

mmubarak commented 10 years ago

Actually you don't always see a warning if the simulation runs out of event memory. Recently, I ran into cases where I was getting some unexpected simulation output without any warning message that its running out of memory. Needless to say that I spent quite some time digging my code to find whats wrong :-) Things got resolved once I increased the event memory.

So I think if there is a way we could ensure that the warning always appears, that would be useful in the debugging process.

JohnPJenkins commented 10 years ago

I wonder if there's a practical ceiling for no_free_event_buffers before ROSS should give up and tw_error out? 10? 20? 100?

W.r.t. halting, is the condition of all PEs making no progress or stalling between GVTs sufficient to give up / finalize? The two major possibilities here I can think of are: 1) no more events to process (not an error); 2) PE(s) with insufficient memory propagate inability to progress to all other PEs.

carothersc-zz commented 10 years ago

Thanks John;

So, memory management is a bit of a sticky issue in optimistic parallel event simulators. In particular, a model could have a small set of LPs that race ahead and effectively consume the available event memory, even all the available system memory. Additionally, a model developer could do some bad things like unknowingly b-cast events to a large group of LPs - this causes a swell in the pending event population. If left unchecked, the model will exhaust all available event memory even if only running in serial mode.

The trick is the model developer needs to have some sense of what there peak event memory needs are for the when executing on a single processor. For network models, you can get a sense of that based on what you think the average hop counts are coupled with arrival rate of new packets.

Then you want to add just enough memory for efficient optimistic execution. This is typically no more than a 8K to 16K event memory buffers assuming your batch and GVT interval values 8 and 512 respectively -- e.g., on avg between successive GVTs each MPI rank will process about batch X GVT events -- so at 8 batch and 512 GVT, you'll process about 4K events per GVT epoch.

Hope that helps, Chris

On Fri, Aug 8, 2014 at 11:42 AM, John Jenkins notifications@github.com wrote:

I wonder if there's a practical ceiling for no_free_event_buffers before ROSS should give up and tw_error out? 10? 20? 100?

W.r.t. halting, is the condition of all PEs making no progress or stalling between GVTs sufficient to give up / finalize? The two major possibilities here I can think of are: 1) no more events to process (not an error); 2) PE(s) with insufficient memory propagate inability to progress to all other PEs.

— Reply to this email directly or view it on GitHub https://github.com/carothersc/ROSS/issues/6#issuecomment-51618815.


Christopher D. Carothers

Director, Center for Computational Innovations Professor, Department of Computer Science Rensselaer Polytechnic Institute 110 8th Street Troy, New York 12180-3590

e-mail: chrisc@cs.rpi.edu web page: www.cs.rpi.edu/~chrisc phone: (518) 276-2930

fax: (518) 276-4033

mmubarak commented 10 years ago

Not directly related but one thing that we can do is to update the event memory allocation formula in codes-base (the codes-mapping API). Currently, its pretty static i.e. it multiplies a constant value with the number of LPs per PEs and allocates event memory accordingly. This works for some of the models while it fails for some others. Ideally, if we could dynamically calculate the event memory based on the model parameters, that could help us resolve some of the memory problems. Not something that has to be done right away but we might want to do it at some point.

JohnPJenkins commented 10 years ago

Thanks Chris! A lot of useful information in there. Perhaps your response would make for good content in the wiki?

Misbah: that's been a backburner item I haven't quite gotten around to, but is relatively easy to achieve. The only question is whether to have it as a run-time argument (i.e. passed through argv and required in all tw_opts) or as a configuration parameter (processed through the codes-config code path). I'm leaning towards the latter.

As you mention, making the "mem_factor" multiplier configurable makes it easier to play with available memory but does not eliminate the core (undecidable?) problem of hanging on out-of-event-memory conditions.

JohnPJenkins commented 10 years ago

As discussed in the meeting today, the hanging problem isn't something that can be solved due to time warp semantics and ROSS memory upper bounds. It might not be a bad idea to kill the simulation after a large number of successive failures as Justin suggested, but I'll leave that up to you guys. Closing the issue...