ga4gh / task-execution-schemas

Apache License 2.0
80 stars 27 forks source link

Add CANCELING state to delete cloud resources #189

Closed MattMcL4475 closed 1 year ago

MattMcL4475 commented 1 year ago

Currently TES has a CANCELED state. However, cancelling a task means the associated cloud resources need to be deleted, which can take a few minutes to complete - much longer than is reasonable for an HTTP request timeout.

Proposal: Add a new TES state: CANCELING. When a caller calls the CancelTask operation, TES shall set the task's state to CANCELED if and only if all dependent/downstream resources have been deleted. If it cannot delete the downstream cloud resources during the HTTP request lifetime, it shall set the task state to CANCELING, until all downstream resources have been deleted.

Example: in Cromwell on Azure, a user can submit a workflow that runs one task. If a user cancels that workflow while the task is still running, currently, Cromwell will call CancelTask, and the TES state will be set to CANCELED. However, at this point, the Azure Batch job and pool have NOT been deleted yet, since this is done by a background thread running on the TES web server. This is problematic because Cromwell cannot observe whether TES' expensive resources have been deleted yet, which can leave the system in a broken and expensive state, with VMs sitting idle in Azure Batch. A control plane that is managing the Cromwell environment (such as Terra) can also therefore not observe whether the system has successfully completely cancelled all running workflows.

kellrott commented 1 year ago

Can you re-target the PR to develop-1.1? I'm trying to collect all changes into that, so I can run on a final merged PR with CI before merging into develop.

MattMcL4475 commented 1 year ago

Thank you @uniqueg and @kellrott for the quick feedback! Changed it to CANCELING and re-targeted develop-1.1

MattMcL4475 commented 1 year ago

Thanks @uniqueg ; @kellrott looks like it's ready to merge