Frequent 449 Errors When Creating or Updating Records in Cosmos DB

We are experiencing a significant increase in 449 errors, despite not sending concurrent requests at the time of these errors. We are currently using SDK version 3.32.0.

The documentation states that we need to retry the operation when encountering HTTP 449 errors. However, we cannot find any options in the SDK to customize the retry mechanism. The default retry by the SDK seems to take more than 30 seconds to propagate or report the HTTP 449 error in application level, which coincides with our load balancer's connection cut-off time of 30 seconds thus we cannot even manually retry ourselves.

We have a few questions:

How can we implement a random back-off for retries (as suggested in the documentation) when we cannot customize the retry options in the SDK?
Are there any known issues on your end that could be contributing to this increase in 449 errors?
Would upgrading to the latest version of the SDK potentially resolve this problem? If so, could you please provide the relevant release notes for our reference?

Thank you for your assistance in resolving this issue.

Are you using the "Point-In-Time Restore" backup capability by any chance?

We have observed the same behavior with our Cosmos database (SQL) and were convinced the 449s (RetryWith) are not a result of incorrect application logic (e.g. concurrent writes); especially because until the timeout was hit (in our case 1 minute) the SDK would continuously retry on its own. Believe this is the intended behavior of the SDK in case of 449s, but we had some doubts the 449s were correct.

So we opened a ticket with Azure support to assist us in this matter. After some time and back and forth, yesterday we received an update with the root cause of the issue. They mentioned they found an issue with the "Point-In-Time Restore" backup capability. In short what we learned:

It seems the replication to the backup store happens synchronously instead of asynchronously;
If the amount of data to be pushed to the backup store is beyond a specific limit, the backup requests are throttled;
Bug: In some cases the system would incorrectly assume/determine the pending data is beyond the mentioned specific limit, causing requests to be throttled.

The combination of the above 3 points can cause the SDK to receive the 449 (RetryWith) responses and per design keep retrying (until the timeout is hit).

They have mitigated our issue by enabling a confriguration on our database which unblocks user requests in case the throttling to the backup store is being applied; so our requests no longer fail for this reason.

The good news is they mentioned a more permant fix (make the replication to the backup store asynchronous) has already been developed and is currently being tested. I don't have an ETA for this fix to be applied to the CosmosDB global service.

Hopefully someone from the CosmosDB team is able to confirm this issue in case others come looking for it.

In the mean time - for your specific case - would it be an option to reduce the query timeout to a value below the 30 seconds (e.g. 15 seconds)? If you catch this case in a custom RequestHandler (after 15sec, the response will have status code 449) you should be able to immediately initiate a retry by re-sending the request.

Azure / azure-cosmos-dotnet-v3

Frequent 449 Errors When Creating or Updating Records in Cosmos DB #4657