Closed ghost closed 3 years ago
Can you please explain your scenario of needing to delete and re-create the pool in the middle of a job run?
We have long running jobs with thousands of tasks. Sometimes in the middle of job run, we get automated security/compliance related emails that OS/software on nodes need to be updated with latest security fixes. We want to preserve job progress, so it is the easiest to just delete pool and re-create it.
You may want to consider a different practice instead of deleting the pool underneath a running job or set of running jobs.
Shipyard has the ability to live migrate jobs/job schedules to another pool. Suggest investigating the following workflow:
shipyard jobs migrate ...
(see the usage docs for more info, specify either --requeue
or --terminate
)You can augment the flow by setting autoscaling policies on the old pool to automatically scale down to zero and query if the pool has no active nodes in it to follow with a delete. Otherwise you can use Shipyard to query the old pool for any running nodes and delete.
There are some scenarios where we need to delete and re-create a pool in the middle of job run. Ideally, we would like to do this without touching (issuing manual or API commands against) the job and its tasks. So we just delete a pool and add it again. This scenario works ok if we do it manually. However, the pool add command waits for all nodes to be in one of the following states: {idle, preempted, start_task_failed, unusable}. Since the job already exists and its tasks are ready to run, all nodes quickly transition into running state. This means that pool add command will wait for ever.