ga4gh / task-execution-schemas

Apache License 2.0
80 stars 27 forks source link

Feature: provide a mechanism to inform the caller that a quota is exceeded #200

Open MattMcL4475 opened 8 months ago

MattMcL4475 commented 8 months ago

Currently, when Azure Batch has no quota available, a TES Task in TES on Azure will stay in the INITIALIZING state indefinitely until quota becomes available. This could be minutes, hours, or even days. TES needs a way to inform the caller why this is the case, so that the caller can update the UI with this additional information, and the user or IT admin knows they need to submit an Azure Support Request to increase their quota. Otherwise, they don't have visibility into why the task is not progressing.

Ideally we actually want the caller to parse the string Pending available quota: low-priority vCPUs, to recognize that there is a quota issue, and that the specific quota is low-priority vCPUs, so I'm also open to the idea of adding a specific string property to the TES Task such as quotaTypeExceeded, and set it to a value such as low-priority vCPUs or NVSv3 Series

MattMcL4475 commented 8 months ago

@patmagee what are your thoughts on the best way to handle this?

patmagee commented 8 months ago

@MattMcL4475 i would be in favour of a fail fast model. I think waiting for quota for a given period of time is okay, but beyond a "reasonable limit" the tes task should fail.

I propose a new state INSUFFICIENT_RESOURCES to be treated as a failure state. If returned it would be highly informative to the end user why something failed. Combine that with a message for the failure and that should allow diagnosis of most failures

patmagee commented 8 months ago

I think this state would fit well onto a WES workflow as well btw