Intermittent transient errors trigger retry mechanism of Azure SDK APIs and lead to Conflict/NotFound errors

Version

18d694cc1f1bd72826e7ccbfead499beb4613116

Description

We found that transient network errors (e.g., network timeout or package drop) may happen during the execution of Azure SDK APIs, and the retry mechanism of SDK would lead to Conflict/Not Found errors. There are several places we found the errors may happen: AddEntity, DeleteEntity, UpdateEntity, AddEntity, UpdateEntity, UpdateEntity, UpdateEntity.

How to reproduce

More precisely, if the effect has actually taken place in the server but the response never returns to the client on time due to transient network errors, the retry logic in Azure Storage SDK will send another request again. Since the first operation has been successfully done in the remote, there will be a Conflict or Not Found error when the second request arrives, DeleteEntity will lead to 404, AddEntity will encounter 409.

Discussion

A better practice could be to wrap these APIs into the try-catch block and handle the RequestFailedException exceptions as the lease acquire. We are willing to contribute to this, however, we are not sure how to do it in an elegant way. Is it possible to wrap each of them into a try-catch block?

For Add (a.k.a. insert) flows, a conflict error is tricky to retry on because, in general, it's not clear that the existing entity and the entity to add are logically equivalent in that specific flow to allow the try-catch to safely swallow the exception (or retry). You'd need to be really careful that the code inside of the try-catch doesn't modify the entity in memory without performing an update or doesn't regenerate the entity in memory as part of the try-catch. This can be very tricky as code becomes more layered and abstracted. I don't have all of the "adds" in my head to know whether most adds can benefit from this optimization.

Similarly, with Update and Delete, you want to be sure you don't clobber another thread/worker's update and hard to know this if your previous attempt threw some transient error but actually succeeded server side (meaning the etag changed)...

What you're suggesting is a category of change that needs to be assessed at the high level (in code) and calculated into the idempotency of that high level operation.

Let's look at each of the cases you provided:

AddEntity, DeleteEntity, UpdateEntity - MutableTableTransactionalBatch is essentially "SDK" code. I don't know of a good way to retry here safely without inspecting all of the places this batch class is used (which is probably 10+ places IIRC?)
AddEntity, UpdateEntity, UpdateEntity, UpdateEntity - these probably can get clever. The TimerEntity is not a complex entity. It's pretty much just "is enabled" and when it last ran so you could catch a transient error, read the current entity (extra GET) compare the state/do a merge, then issue an update.

Certainly, we can increase the code complexity and save some round trips with Table storage and/or Queue storage but I am not convinced in these two cases (generic batch "SDK" and the light timer entity that is typically hit on a recurring cadence anyway) that it's safe and worth the added complexity. Perhaps I am missing your point. Could you describe the specific high-level operation (e.g. execution of driver X, or aggregation of driver Y, or execution of timer Z) that you're seeing negative impact on?

I think for these sort of changes, it would help me to understand the answer to this question: does the user facing action get blocked because of the transient error or does it just get delayed? If it just gets delayed, but how much time? Seconds? Minutes? Hours?

A core design principal I have for all of the Insights code is keep the code simple and use queue-based retries as much as possible. I do indeed implement retries in some places but that typically because the data pipeline gets into a blocked or heavily delayed state without the retry. I've certainly missed places that will get pinched in the case of higher network congestion/unreliability. Maybe your network environment is different than what I normally run with so maybe you're seeing a bunch of problems that I've simply never experienced.

NuGet / Insights