Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

pool add command never completes if corrresponding job alredy exists with tasks ready to run #344

Closed ghost closed 3 years ago

ghost commented 4 years ago

There are some scenarios where we need to delete and re-create a pool in the middle of job run. Ideally, we would like to do this without touching (issuing manual or API commands against) the job and its tasks. So we just delete a pool and add it again. This scenario works ok if we do it manually. However, the pool add command waits for all nodes to be in one of the following states: {idle, preempted, start_task_failed, unusable}. Since the job already exists and its tasks are ready to run, all nodes quickly transition into running state. This means that pool add command will wait for ever.

alfpark commented 4 years ago

Can you please explain your scenario of needing to delete and re-create the pool in the middle of a job run?

ghost commented 4 years ago

We have long running jobs with thousands of tasks. Sometimes in the middle of job run, we get automated security/compliance related emails that OS/software on nodes need to be updated with latest security fixes. We want to preserve job progress, so it is the easiest to just delete pool and re-create it.

alfpark commented 4 years ago

You may want to consider a different practice instead of deleting the pool underneath a running job or set of running jobs.

Shipyard has the ability to live migrate jobs/job schedules to another pool. Suggest investigating the following workflow:

  1. Spin up new pool.
  2. Execute shipyard jobs migrate ... (see the usage docs for more info, specify either --requeue or --terminate)
  3. Delete old pool once tasks have vacated off the pool (after the job has been migrated successfully).

You can augment the flow by setting autoscaling policies on the old pool to automatically scale down to zero and query if the pool has no active nodes in it to follow with a delete. Otherwise you can use Shipyard to query the old pool for any running nodes and delete.