Open MattMcL4475 opened 8 months ago
@patmagee what are your thoughts on the best way to handle this?
@MattMcL4475 i would be in favour of a fail fast model. I think waiting for quota for a given period of time is okay, but beyond a "reasonable limit" the tes task should fail.
I propose a new state INSUFFICIENT_RESOURCES
to be treated as a failure state. If returned it would be highly informative to the end user why something failed. Combine that with a message for the failure and that should allow diagnosis of most failures
I think this state would fit well onto a WES workflow as well btw
Currently, when Azure Batch has no quota available, a TES Task in TES on Azure will stay in the
INITIALIZING
state indefinitely until quota becomes available. This could be minutes, hours, or even days. TES needs a way to inform the caller why this is the case, so that the caller can update the UI with this additional information, and the user or IT admin knows they need to submit an Azure Support Request to increase their quota. Otherwise, they don't have visibility into why the task is not progressing.Ideally we actually want the caller to parse the string
Pending available quota: low-priority vCPUs
, to recognize that there is a quota issue, and that the specific quota islow-priority vCPUs
, so I'm also open to the idea of adding a specificstring
property to the TES Task such asquotaTypeExceeded
, and set it to a value such aslow-priority vCPUs
orNVSv3 Series