Closed MattMcL4475 closed 1 year ago
Can you re-target the PR to develop-1.1? I'm trying to collect all changes into that, so I can run on a final merged PR with CI before merging into develop.
Thank you @uniqueg and @kellrott for the quick feedback! Changed it to CANCELING
and re-targeted develop-1.1
Thanks @uniqueg ; @kellrott looks like it's ready to merge
Currently TES has a
CANCELED
state. However, cancelling a task means the associated cloud resources need to be deleted, which can take a few minutes to complete - much longer than is reasonable for an HTTP request timeout.Proposal: Add a new TES state:
CANCELING
. When a caller calls theCancelTask
operation, TES shall set the task's state toCANCELED
if and only if all dependent/downstream resources have been deleted. If it cannot delete the downstream cloud resources during the HTTP request lifetime, it shall set the task state toCANCELING
, until all downstream resources have been deleted.Example: in Cromwell on Azure, a user can submit a workflow that runs one task. If a user cancels that workflow while the task is still running, currently, Cromwell will call
CancelTask
, and the TES state will be set toCANCELED
. However, at this point, the Azure Batch job and pool have NOT been deleted yet, since this is done by a background thread running on the TES web server. This is problematic because Cromwell cannot observe whether TES' expensive resources have been deleted yet, which can leave the system in a broken and expensive state, with VMs sitting idle in Azure Batch. A control plane that is managing the Cromwell environment (such as Terra) can also therefore not observe whether the system has successfully completely cancelled all running workflows.