Azure / hpcpack

The repo to track public issues for Microsoft HPC Pack product.
MIT License
29 stars 11 forks source link

Additional Preemption Options: Suspend and Share #30

Open ourcookiemonster opened 10 months ago

ourcookiemonster commented 10 months ago

Currently HPC Pack supports only immediate or graceful preemption. In immediate preemption, lower priority tasks (or jobs) get cancelled. In graceful preemption, lower priority tasks are allowed to complete.

In our cluster environment, compute is more scarce than memory. Our high priority jobs are time-sensistive, realtime and short running. Our lower priority jobs are not time-sensitive and are often very long running.

Immediate preemption is required to get the production tasks executed in a timely fashion but is suboptimal because the long running, low priority tasks lose their work. Graceful preemption would save the work of the long running tasks but the high priority production jobs would not run in time.

We would like a third preemption option: Suspension. In this preemption mode the existing running lower priority tasks can have their Process Priority in Windows lowered to "Below normal" or "Low". The higher priority task can then be launched on that core. One thing to note, is that I think technically all processes spawned by a task would need to be set to lower priority. Once the high priority task completes, the low priority task processes can get their Process Priority set back to "Normal".

Provided there is no memory constraint between the two tasks being alive simultaneously this should solve our scheduling problem in a much nicer way than either the Immediate or Graceful options.

In fact, even just letting both tasks run simultaneously and letting the OS equally share the core between the low and high priority task is preferable to Immediate or Graceful. This could be a fourth preemption option: Share. In this preemption mode a core will permit a new task to join the core if its Job priority > any Job priority of the currently running task(s) on that Core.

This type of preemption can also be very useful from running maintenance tasks across all nodes.

Please let us know if/when this feature could be implemented. I believe similar type of feature exists in other cluster scheduling tools such as slurm.