Improve Resource Status Messages Upon Failure

mehighlow commented 1 year ago

Describe the current behavior When a resource creation fails, ASO exposes the last status Azure provides, which might not be informative.

E.g. CosmosDB provisioning failed due to capacity constraints.

Status:
  Conditions:
    Last Transition Time:  2023-09-19T00:21:23Z
    Message:               DatabaseAccount XXXXXXX is in a failed provisioning state because the previous attempt to create it was not successful. Please delete the previous instance before attempting to recreate this account.
ActivityId: 00000000-0000-0000-0000-000000000000, Microsoft.Azure.Documents.Common/2.14.0: PUT https://management.azure.com/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/XXXXXXX/providers/Microsoft.DocumentDB/databaseAccounts/XXXXXXX
--------------------------------------------------------------------------------
RESPONSE 400: 400 BadRequest
ERROR CODE: BadRequest
--------------------------------------------------------------------------------
{
  "code": "BadRequest",
  "message": "DatabaseAccount XXXXXXX is in a failed provisioning state because the previous attempt to create it was not successful. Please delete the previous instance before attempting to recreate this account.\r\nActivityId: 00000000-0000-0000-0000-000000000000, Microsoft.Azure.Documents.Common/2.14.0"
}
--------------------------------------------------------------------------------

This message provides clear instructions for re-creating the DatabaseAccount, while the earlier message in close proximity exposes another reason for the failure:

Describe the improvement To group events occurring within a ~5-minute proximity and extend status messages to provide more details, add an event category.

Additional context I've encountered these issues not only with CosmosDB but also with different Azure objects.

theunrepentantgeek commented 1 year ago

This is a good idea.

I suspect it might not be quite as simple as just grouping together messages based on time, given that outside factors might be at play. We'll talk this through at our next regular sync.

matthchr commented 1 year ago

We should look into making sure that we publish events that contain details about these errors in the Kubernetes event log for this resource.

@mehighlow have you seen this problem where the "real" error is obscured by a subsequent retry for other resources? What other resource types? It might be worth us talking to the CosmosDB team and asking them if they could include the "cause of failure" in this error message as well, as it would lift all boats (ASO, Terraform, etc) which may re-apply a resource to retry it.

matthchr commented 8 months ago

I've looked into this some. As far as I know, Kubernetes groups events only if the text is exactly the same. In cases like this where we've gotten errors that contain activityIDs or other dynamically generated fields (timestamps), they'll not be grouped in event viewer. We could redact those fields to make the events uniform and groupable, but if we do that then the events are less useful as they don't actually contain IDs that you could take to Azure support.

I think the right fix here is to just mark the initial error:

  Warning  CreateOrUpdateActionError  29m   DatabaseAccountController  Reason: ServiceUnavailable, Severity: Warning, RetryClassification: RetrySlow, Cause: Database account creation failed. Operation Id: 1403e78b-ab8a-4bb3-85b4-285aab8a96f8, Error : Message: {"code":"ServiceUnavailable","message":"Sorry, we are currently experiencing high demand in West US 3 region, and cannot fulfill your request at this time Thu, 04 Apr 2024 17:35:21 GMT. To request region access for your subscription, please follow this link https://aka.ms/cosmosdbquota for more details on how to create a region access request.\r\nActivityId: 8c3213a5-c570-4a8a-b337-e05de991eb9e, Microsoft.Azure.Documents.Common/2.14.0"}

as fatal.

That way we don't retry and get the less-helpful error.

This results in:

NAME               READY   SEVERITY   REASON               MESSAGE
matthchr-db-acct   False   Error      ServiceUnavailable   Database account creation failed. Operation Id: 768740a2-a70f-4dd0-8c8e-e4fe75ed0bf1, Error : Message: {"code":"ServiceUnavailable","message":"Sorry, we are currently experiencing high demand in West US 3 region, and cannot fulfill your request at this time Thu, 04 Apr 2024 18:20:43 GMT. To request region access for your subscription, please follow this link https://aka.ms/cosmosdbquota for more details on how to create a region access request.\r\nActivityId: 901bb9c4-21e5-43ad-b4ab-17a24cdcee47, Microsoft.Azure.Documents.Common/2.14.0"}, Request URI: /serviceReservation, RequestStats: , SDK: Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0: GET https://management.azure.com/subscriptions/4ef44fef-c51d-4d7c-a6ff-8635c02848b1/providers/Microsoft.DocumentDB/locations/westus3/operationsStatus/07c29785-548d-40d3-bb8d-a925ed207f11...

which I think is more helpful than what we were doing before. That change is in PR in #3906.

matthchr commented 8 months ago

@mehighlow if you know of other resources you've seen similar behavior on, please let us know, we may be able to improve their experiences as well.

mehighlow commented 7 months ago

@matthchr, you bet! Thanks for the improvement!

Azure / azure-service-operator

Improve Resource Status Messages Upon Failure #3322