SCALE-MS / scale-ms

SCALE-MS design and development
GNU Lesser General Public License v2.1
4 stars 4 forks source link

Long-running processes may need monitoring and pruning #341

Open eirrgang opened 1 year ago

eirrgang commented 1 year ago

@wehs7661 points out that the software to execute tasks may have memory leaks that could drag down a Worker over time. For instance, libgromacs may leak as much as a few MB per simulation (due generally to historic assumptions that a single process would not run more than one simulation, so long-lived data structures and once-per-simulation function calls are low priority modernization targets).

In order to accommodate workflows with many thousands of instances of similar tasks (such as iterations of simulators), we should be ready to periodically shut down and respawn Worker tasks.

In the first implementation, we can just have a max_reuse parameter or something. In a more refined implementation, we could monitor Workers to see if their memory footprints are growing steadily or if they are taking longer to complete similar work, and apply some sort of heuristic to say "okay, this is going to be your last task and then you should stop".

@andre-merzky : can you comment on whether it would be easier/better for a Worker to shut itself down based on internal logic vs. the Master stopping specific Workers (or, I suppose, all Workers) with the help of some sort of polling or reporting from the Worker(s)?

My expectation is that it would be reasonable for a Worker to unregister itself, triggering a status change through Master.worker_state_cb() (WARNING: the expected state values are not documented). Then the Master can launch a new Worker, if necessary, while the Worker finishes its current Task, publishes results (triggering result_cb() to occur after the unregistration), and shuts itself down.

andre-merzky commented 1 year ago

@andre-merzky : can you comment on whether it would be easier/better for a Worker to shut itself down based on internal logic vs. the Master stopping specific Workers (or, I suppose, all Workers) with the help of some sort of polling or reporting from the Worker(s)?

I would suggest to let the worker decide about it's own termination indeed. It then has a chance to drain it's own work queue before terminating, thus not resulting in terminated tasks. The master does not know if the worker is active on any tasks, thus when the master kills a worker, one would need additional bookkeeping to detect what tasks were killed along with it.

eirrgang commented 1 year ago

I would suggest to let the worker decide about it's own termination indeed.

What should the Worker do to unregister and stop accepting new tasks?

What signals/states will the Master see on worker_state_cb() at which points as the Worker stops accepting tasks and shuts itself down?

andre-merzky commented 1 year ago

We don't have a public API for that functionality at the moment. It boils down to stop the queue listening threads and then to drain the internal work queue. It needs some implementation work to expose that cleanly in the worker class.

What's the priority on this? My understanding this is a 'nice-to-have' in case the memory buildup turns out to be serious?

eirrgang commented 1 year ago

What's the priority on this? My understanding this is a 'nice-to-have' in case the memory buildup turns out to be serious?

It turns out that @wehs7661 cannot achieve useful research results without a number of simulations that exceeds the reusability of a process by several orders of magnitude, and the performance degradation with libgromacs process reuse is not likely to be resolved on the SCALE-MS project time scale.

Refinement of the monitoring and "smart" recycling will be nice to have, but it is high priority to avoid reusing a Worker process indefinitely, at least when libgromacs is involved. We could

andre-merzky commented 1 year ago

The easiest option might be for a master to terminate all workers after N tasks and to restart them. We have the mechanism for that in place, the master can do the task counting, and it can also delay the submission of received tasks to drain workers before recycling them. What do you think?

eirrgang commented 1 year ago

That sounds like it should work. Presumably, we could submit a new Master with a new raptor ID and get it spinning up and start routing tasks to it even as the previous one is coming down.

Or do you propose to completely quiesce the raptor resources with sort of a workflow barrier periodically from within the Master? I think it might be hard to maintain reasonable occupancy of the Pilot job in that case.

andre-merzky commented 1 year ago

get it spinning up and start routing tasks to it even as the previous one is coming down.

Well, the master would not start as long as the other master and it's workers hog all available resources.

Or do you propose to completely quiesce the raptor resources with sort of a workflow barrier periodically from within the Master? I think it might be hard to maintain reasonable occupancy of the Pilot job in that case.

I don't think you can enforce worker teardown / restart w/o some loss of resource utilization. The amount of that loss depends on task runtime (longer tasks -> longer time to drain) and frequency of required restarts.

One more question: at the moment, we are not running Gromacs functions but Gromacs executables (IIUC). As such, memory should actually not leak. So I would assume that this is not required for the current set of experiements, is that correct?

eirrgang commented 1 year ago

at the moment, we are not running Gromacs functions but Gromacs executables (IIUC)

We are doing both. And we were trying to work towards using more function-based gromacs access, but we keep finding gotchas like this.

It's also worth noting that another potential mitigation is through long-intended but long-delayed updates to gmxapi to allow simulators to be reused with fresh inputs without reinitializing the whole libgromacs. I think that is now beyond the scope of the current projects, though.

peterkasson commented 1 year ago

Just $.02 here--we don't want the solution be "master blocks for N tasks completed on all workers and shuts down/restarts"...that would get us into the exponential tail problem. If we have an essentially non-blocking way to tear down and restart from checkpoint, that would work? (e.g. keep assigning work and processing it until condition is met, then tear down, restart, and keep going)

eirrgang commented 1 year ago

I think, generally, we will have Workers that are provisioned to accommodate a single simulation task at a time. The simulation tasks have the most important effect on the Worker resource requirements and are likely to be the "biggest" tasks.

So draining the queue shouldn't be much of a problem. There should only be one simulation task per worker, generally.

We will have to keep in mind that we have the potential to waste resources in situations like this, of course.

We also should acknowledge that we don't want to get stuck with this tail problem, and should have a medium-term plan to allow individual workers to be removed (or remove themselves) from the pool and new workers to be added. Until then, we just have to try to identify the way to waste the fewest resources.