Closed mehighlow closed 7 months ago
This is a good idea.
I suspect it might not be quite as simple as just grouping together messages based on time, given that outside factors might be at play. We'll talk this through at our next regular sync.
We should look into making sure that we publish events that contain details about these errors in the Kubernetes event log for this resource.
@mehighlow have you seen this problem where the "real" error is obscured by a subsequent retry for other resources? What other resource types? It might be worth us talking to the CosmosDB team and asking them if they could include the "cause of failure" in this error message as well, as it would lift all boats (ASO, Terraform, etc) which may re-apply a resource to retry it.
I've looked into this some. As far as I know, Kubernetes groups events only if the text is exactly the same. In cases like this where we've gotten errors that contain activityIDs or other dynamically generated fields (timestamps), they'll not be grouped in event viewer. We could redact those fields to make the events uniform and groupable, but if we do that then the events are less useful as they don't actually contain IDs that you could take to Azure support.
I think the right fix here is to just mark the initial error:
Warning CreateOrUpdateActionError 29m DatabaseAccountController Reason: ServiceUnavailable, Severity: Warning, RetryClassification: RetrySlow, Cause: Database account creation failed. Operation Id: 1403e78b-ab8a-4bb3-85b4-285aab8a96f8, Error : Message: {"code":"ServiceUnavailable","message":"Sorry, we are currently experiencing high demand in West US 3 region, and cannot fulfill your request at this time Thu, 04 Apr 2024 17:35:21 GMT. To request region access for your subscription, please follow this link https://aka.ms/cosmosdbquota for more details on how to create a region access request.\r\nActivityId: 8c3213a5-c570-4a8a-b337-e05de991eb9e, Microsoft.Azure.Documents.Common/2.14.0"}
as fatal.
That way we don't retry and get the less-helpful error.
This results in:
NAME READY SEVERITY REASON MESSAGE
matthchr-db-acct False Error ServiceUnavailable Database account creation failed. Operation Id: 768740a2-a70f-4dd0-8c8e-e4fe75ed0bf1, Error : Message: {"code":"ServiceUnavailable","message":"Sorry, we are currently experiencing high demand in West US 3 region, and cannot fulfill your request at this time Thu, 04 Apr 2024 18:20:43 GMT. To request region access for your subscription, please follow this link https://aka.ms/cosmosdbquota for more details on how to create a region access request.\r\nActivityId: 901bb9c4-21e5-43ad-b4ab-17a24cdcee47, Microsoft.Azure.Documents.Common/2.14.0"}, Request URI: /serviceReservation, RequestStats: , SDK: Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0, Microsoft.Azure.Documents.Common/2.14.0: GET https://management.azure.com/subscriptions/4ef44fef-c51d-4d7c-a6ff-8635c02848b1/providers/Microsoft.DocumentDB/locations/westus3/operationsStatus/07c29785-548d-40d3-bb8d-a925ed207f11...
which I think is more helpful than what we were doing before. That change is in PR in #3906.
@mehighlow if you know of other resources you've seen similar behavior on, please let us know, we may be able to improve their experiences as well.
@matthchr, you bet! Thanks for the improvement!
Describe the current behavior When a resource creation fails, ASO exposes the last status Azure provides, which might not be informative.
E.g. CosmosDB provisioning failed due to capacity constraints.
This message provides clear instructions for re-creating the DatabaseAccount, while the earlier message in close proximity exposes another reason for the failure:
Describe the improvement To group events occurring within a ~5-minute proximity and extend status messages to provide more details, add an event category.
Additional context I've encountered these issues not only with CosmosDB but also with different Azure objects.