Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.26k stars 4.6k forks source link

[QUERY] How to handle etag mismatch (412 Precondition Failed) caused by client retry #39146

Closed joelverhagen closed 11 months ago

joelverhagen commented 11 months ago

Library name and version

Azure.Data.Tables 12.8.1

Query/Question

Consider this scenario:

  1. Azure.Data.Tables client updates an entity with an etag (If-Match: <etag>)
  2. The server times out or returns a retriable status code after the data operation is complete on the server side.
  3. Client retries the operation with the same etag, not knowing the server state has changed
  4. Client encounters HTTP 412 Precondition failed

(there are adjacent scenarios like adding an entity for the first time and retrying to get a conflict)

What is the client to do to recover?

There are certainly solutions existing today, for example the client can retry the whole unit of work bounding the table storage operation (perhaps no-oping if the Table state actually went through or using a totally different entity depending on the implementation). However, sometimes the cost to retry the entire operation in the application is high enough that it's worth taking on more complexity to handle the 412 gracefully (maybe there is bunch of other API calls prior to the Azure Table API call and you don't want to repeat those in cases like this).

Simply detecting a retry in the Azure SDK HTTP pipeline is also not sufficient. Some retries will succeed (for example the request fails prior to the data operation persisting on the server side) and others will never succeed (for example another thread updates the entity, causing a legitimate Precondition Failed error).

One idea I wanted to try was performing a client-side diff. So if you hit an HTTP 412 (and optional only if and only if the client performed a retry as an optimization) you can fetch the entity from Table storage and compare the entity with what you were expecting to update. This could be done in a partial manner if you were hoping for a PATCH-like operation or for all properties if you were hoping to replace the entire entity state.

A Table entity is essentially a property bag of well-defined types so performing a client-side diff is pretty straight-forward. Unfortunately, the intermediate model (Dictionary<string, object> from ToOdataAnnotatedDictionary) that would be ideal for generic comparison is internal.

I can perhaps call the method with reflection or write my own serialization model used just for this comparison. But I wanted to raise this problem space to the Azure SDK team and see if there any prior art in this area or recommendations.

Here is a sample app that shows how client-driven retries can cause etag mismatches:

Sample C# console app ```csharp using System.Net; using Azure; using Azure.Core.Pipeline; using Azure.Data.Tables; var messageHandler = new TestHandler { InnerHandler = new SocketsHttpHandler() }; var serviceClient = new TableServiceClient( "UseDevelopmentStorage=true", new TableClientOptions { Transport = new HttpClientTransport(messageHandler) }); var table = serviceClient.GetTableClient("oopsie"); // setup Console.WriteLine("Table go!"); await table.CreateIfNotExistsAsync(); Console.WriteLine("Entity go!"); var entity = new TableEntity("pk", "rk"); entity["State"] = "Active"; entity["HitCount"] = 1; var response = await table.UpsertEntityAsync(entity, TableUpdateMode.Replace); entity.ETag = response.Headers.ETag!.Value; // repro Console.WriteLine("Update with If-Match go!"); entity["HitCount"] = 2; try { await table.UpdateEntityAsync(entity, entity.ETag, TableUpdateMode.Replace); } catch (RequestFailedException ex) when (ex.Status == 412) { Console.WriteLine($"Oopsie. {ex.Status} {ex.ErrorCode}"); // Get entity with client-side diff? } class TestHandler : DelegatingHandler { private int _requestCount; protected override async Task SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) { var response = await base.SendAsync(request, cancellationToken); if (request.Method == HttpMethod.Put && request.RequestUri!.AbsolutePath.EndsWith("(PartitionKey='pk',RowKey='rk')")) { if (Interlocked.Increment(ref _requestCount) == 2) { var statusCode = HttpStatusCode.ServiceUnavailable; Console.WriteLine($"Fake error! ({statusCode})"); return new HttpResponseMessage(statusCode) { RequestMessage = request }; } } return response; } } ```

The output looks like this:

Table go!
Entity go!
Update with If-Match go!
Fake error! (ServiceUnavailable)
Oopsie. 412 UpdateConditionNotSatisfied

Environment

.NET SDK:
 Version:   8.0.100-rc.1.23455.8
 Commit:    e14caf947f

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.22621
 OS Platform: Windows
 RID:         win-x64
 Base Path:   C:\Program Files\dotnet\sdk\8.0.100-rc.1.23455.8\

.NET workloads installed:
There are no installed workloads to display.

Host:
  Version:      8.0.0-rc.1.23419.4
  Architecture: x64
  Commit:       92959931a3
  RID:          win-x64

.NET SDKs installed:
  3.1.426 [C:\Program Files\dotnet\sdk]
  5.0.408 [C:\Program Files\dotnet\sdk]
  6.0.317 [C:\Program Files\dotnet\sdk]
  7.0.203 [C:\Program Files\dotnet\sdk]
  7.0.308 [C:\Program Files\dotnet\sdk]
  7.0.401 [C:\Program Files\dotnet\sdk]
  8.0.100-rc.1.23455.8 [C:\Program Files\dotnet\sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 3.1.32 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 6.0.22 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 7.0.5 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 7.0.11 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 8.0.0-rc.1.23421.29 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 3.1.32 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.22 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 7.0.5 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 7.0.11 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 8.0.0-rc.1.23419.4 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 3.1.32 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 6.0.22 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 7.0.5 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 7.0.11 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 8.0.0-rc.1.23420.5 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

Other architectures found:
  x86   [C:\Program Files (x86)\dotnet]
    registered at [HKLM\SOFTWARE\dotnet\Setup\InstalledVersions\x86\InstallLocation]

Environment variables:
  Not set

global.json file:
  Not found

Learn more:
  https://aka.ms/dotnet/info

Download .NET:
  https://aka.ms/dotnet/download
jsquire commented 11 months ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

christothes commented 11 months ago

Hi @joelverhagen - Just to clarify - are you wanting to perform a GET operation to perform local validation outside the scope of the failed operation (as in your example code in the catch block) or as part of the pipeline process before the entity is transformed from the wire format to the serialized model?

For example, if as part of the Retry process, it may look like:

If what you want to do is outside the scope of the pipeline processing of the request, it seems like that would be possible without having to deal with the internal model transformation, assuming you are handling the failure of just the table operation, as in your example code. Is your concern that doing this would be too expensive?

The other concern I'd have with any sort of local diff approach is that depending on the nature of the update, there is no guarantee that a local comparison would reflect the intended outcome of the current state of the entity on the service. Perhaps the example above is contrived, but if you were updating a HitCount property to increment it, validating that the value is the same as what you intended to set might mean that some competing consumer already performend an update to reflect the 2nd hit and your update should now be setting the value to whatever the Get returned +1. If the update is intended to validate that just the other properties haven't changed, but HitCount shoudl always be set to the original value, you'd have to perform another Update operation anyway. If your update actually is intended to be absolute, that would be equivalent to an unconditional Update.

github-actions[bot] commented 11 months ago

Hi @joelverhagen. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

joelverhagen commented 11 months ago

Thanks @christothes for entertaining my edge case.

Just to clarify - are you wanting to perform a GET operation to perform local validation outside the scope of the failed operation (as in your example code in the catch block) or as part of the pipeline process before the entity is transformed from the wire format to the serialized model?

I think I would like the first idea as step 1, e.g. a catch block of a failed request (409 for inserting, 412 for updating). This would allow an application dev to surgically handle conflicts on a case-by-case basis, perhaps based on risk/reward of the extra complexity (and extra round trip).

The second idea sounds like something I would try if the number of cases got too many and it felt mechanical at the GET -> diff side, i.e. a lot of code duplication. This second idea has the benefit of limiting additional allocations. You don't need to reserialized the domain model to the wire format a second time in the catch block (knowing it was just in the wire format a moment ago in the pipeline).

For example, if as part of the Retry process, it may look like:

Yes, I was imagining something like that as a generic solution across all domain models.

Is your concern that doing this would be too expensive?

I am not so concerned about the compute/memory cost (although it does bother me 😅). I'm more thinking that the True representation of the entity is the wire format and not the domain model, at least for the moment in time when a fraudulent 412/409 occurs.

Consider a case where the diff is done the domain model and then later a new property is added. The wire format (and Azure SDK) will pick up the change automatically via reflection but the hand-rolled diff in the catch would need to react. This could be a source of bugs. So, I would rather perform a diff on the wire format. Granted, a change in domain model might need to change the diff logic in a domain-specific way.

If the update is intended to validate that just the other properties haven't changed, but HitCount shoudl always be set to the original value, you'd have to perform another Update operation anyway.

Yes, I take your point. I think this boils down to a CQRS concept. How is the "command" defined? Is it HitCount++ or is it HitCount = 1 and how should the commands be serialized when multiple threads are at play? From a generic Azure SDK perspective, I don't see how the pipeline could ever guess since Table Storage semantics are lower level than that (add new entity, replace entity fully, merge entity in a specific Table Storage defined way).

If somehow Azure Table Storage persisted a change token or client generated request ID then the GET could be disambiguated. The subsequent GET would have proof that the latest entity state was indeed initiated by that same thread and the 409/412 could be swallowed.

For my application I have an AttemptCount integer and NextAttempt timestamp (set to a client UTC now + some window). So it's not so contrived. In my case, I can check both the AttemptCount and NextAttempt timestamp to see if they are then values I intended (source code). It's highly unlikely for another thread to get the same NextAttempt timestamp so it operates like a hacky request ID concept I mentioned above. So this is closer to the HitCount = 1 command semantics since I don't want to allow another thread to increment it for me. I use If-Match: <etag> as a concurrency control and a way to (best effort) weed out parallel threads caused by message duplication (also quite possible from Azure.Storage.Queues for the same reason as here -- client retries 😄).

So, I think I could add a "change ID" column to all of my entities and handle the 409/412 with a custom pipeline like you described, i.e. "if this thread performed a retry, and a 409/412 was hit, do a GET and compare the change ID column. Swallow the 409/412 if the change token matches".


For the sake of a general Azure SDK question, I think it would be great to have access to the wire model or have the capability to hook into the pipeline with a predicate like this:

public delegate Task<bool> ShouldSwallowFailedRequestAsync(WireFormat model, RequestFailedException ex, bool hasRetry);

And you could call Task<WireFormat> GetEntityInWireFormatAsync(string pk, string, rk, ....) in the predicate and then do a relatively generic and safe diff. This API shape is a straw man (and it kind of sucks), I guess the point is a) knowing a retry occurred and b) having access to wire format for generic diff logic. Whether or not this happens inside or outside of the pipeline is just a performance consideration maybe?

christothes commented 11 months ago

Thanks @joelverhagen for the additional context!

Consider a case where the diff is done the domain model and then later a new property is added. The wire format (and Azure SDK) will pick up the change automatically via reflection but the hand-rolled diff in the catch would need to react. This could be a source of bugs. So, I would rather perform a diff on the wire format. Granted, a change in domain model might need to change the diff logic in a domain-specific way.

I think there are bug trade-offs either way. But if your comparison was targeted for the scenario rather than a complete entity comparison, a model comparison seems safe here.

Something else to consider is that the wire format is actually not internal, it's just a Dictionary<string, object>. It's the transformation to/from the wire format that is internal. If your intent is just to compare (and possibly edit) discrete properties, the dictionary format should be all that you'd need here.

But all that said, the only place you'd be able to evaluate and manipulate the request and response would be in a pipeline policy. I think a policy-based solution would be much more complex for limited additional benefit. However, if you decide this is the path you think makes the most sense for your scenario, I'd be happy to look at any implementation ideas you come up with. The easiest path would probably be to supply a custom RetryPolicy and put the logic there.

github-actions[bot] commented 11 months ago

Hi @joelverhagen. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

joelverhagen commented 11 months ago

I think there are bug trade-offs either way. But if your comparison was targeted for the scenario rather than a complete entity comparison, a model comparison seems safe here.

I agree with that.

If your intent is just to compare (and possibly edit) discrete properties, the dictionary format should be all that you'd need here.

Yes, that would be ideal. Unfortunately, I don't see an easy way to get access to the dictionary. I would like to call the ToOdataAnnotatedDictionary extension method but it is internal. https://github.com/Azure/azure-sdk-for-net/blob/aa5fe51801498a3d967b85caaeb3bd53c4d3a2ad/sdk/tables/Azure.Data.Tables/src/Extensions/TableEntityExtensions.cs#L16

Additionally the binder used inside the method is internal: https://github.com/Azure/azure-sdk-for-net/blob/aa5fe51801498a3d967b85caaeb3bd53c4d3a2ad/sdk/tables/Azure.Data.Tables/src/Extensions/TablesTypeBinder.cs#L11

Would it be possible to make this public or is there another way to get the dictionary that I am missing?

But all that said, the only place you'd be able to evaluate and manipulate the request and response would be in a pipeline policy. I think a policy-based solution would be much more complex for limited additional benefit.

Is it possible to interrogate the Response or some other state to determine if a retry did occur?

Maybe I can do a pass-through/no-op custom policy that simply allows the top-level call site (e.g. TableClient.GetEntityAsync<T>(...)) to somehow know if a retry happened? I'm not sure how it's stash that kind of information on the Response<T> unlike vanilla HttpClient (and it's pipeline) where you can stuff these kind of properties on HttpResponseMessage.RequestMessage.Options.

christothes commented 11 months ago

Unfortunately, I don't see an easy way to get access to the dictionary.

Here is an example of accessing the raw dictionary response:

var response = client.GetEntity<UserEntity>(userEntity.PartitionKey, userEntity.RowKey, null, default);
var dict = response.GetRawResponse().Content.ToObjectFromJson<Dictionary<string, object>>();
foreach (var kvp in dict)
{
    Console.WriteLine($"{kvp.Key}: {kvp.Value}");
/*
prints:
odata.metadata: https://chrissscratch.table.core.windows.net/$metadata#patchicm/@Element
odata.etag: W/"datetime'2023-10-13T15%3A02%3A07.4628778Z'"
PartitionKey: 23ee47a8-5120-49fb-9064-4e12c60fa939
RowKey: 40419cd3-e3c0-4c32-992f-8355632999ab
Timestamp: 2023-10-13T15:02:07.4628778Z
Value: 476692844
*/
}
github-actions[bot] commented 11 months ago

Hi @joelverhagen. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

joelverhagen commented 11 months ago

Thanks @christothes.

However, if you decide this is the path you think makes the most sense for your scenario, I'd be happy to look at any implementation ideas you come up with.

I will take you up on this! I ran into pain points that it would be nice to improve on but it's not critical.

  1. A generalized solution would need access to the Dictionary<string, object> for both the request and the response. Unfortunately, the Response object doesn't have a pointer to the Request object so I have to "guess" that JsonSerializer can produce the right dictionary. ToObjectFromJson seems to operate using the default JsonSerializer so I guess that can work. I switched to Dictionary<string, JsonElement> since that saves a cast.
  2. It's not clear how to plumb information out of a custom RetryPolicy. I tried to add to the property bag with HttpMessage.SetProperty but I can't see how to get the HttpMessage instance outside of the pipeline. I also tried adding a faked request header, but the request isn't available outside of the pipeline (related to the previous problem -- if it was available I could deserialize the request body too). The response headers appear to be read-only so I can't stash a value there either. So I tried with an AzureEventSourceListener but this is different from your suggestion, so I think I missed the point.

Here is the solution I came up with (with a diff from the repro I originally included). https://gist.github.com/joelverhagen/bbf0bdd91cfcdb5784abf135a859a108/revisions?diff=unified#diff-0b69b473fe937040615d69f606751f61ddbc2e3a1849360ff2456c22afe88c0b

christothes commented 11 months ago

What if you do the work of HasEntityChangedAsync inside your custom implementation of RetryPolicy?

github-actions[bot] commented 11 months ago

Hi @joelverhagen. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

joelverhagen commented 11 months ago

I'm not sure why I didn't think of that! Thanks! And yes, that seems works also. Here's my attempt: https://gist.github.com/joelverhagen/bbf0bdd91cfcdb5784abf135a859a108/revisions#diff-0b69b473fe937040615d69f606751f61ddbc2e3a1849360ff2456c22afe88c0b

Challenges of the design:

  1. You lose call site context so you have to sniff it from the HttpMessage.Request.
    • You have to extract table name, PK, and RK from the HttpMessage.Request to fetch the remote state. It's kind of gnarly.
    • You'd need to sniff out whether 412 or 409 is the conflict case (i.e. was it a add entity or an upsert/upsert entity). This is probably based on the HttpMessage.Request If-* headers... complex
  2. You need to override the response to be the expected status code. This is a bit painful since there don't appear to be any in-memory Response implementations as public types. But it's a simple class to implement especially for 204 No Content.
  3. You need another nested service client to fetch the remote state. That's a big painful too.

So the retry occurring inside or outside of the pipeline seems to be a trade-off between which contextual information you want available.

I think this is practical but nasty/complex enough that it should only be done if it's really important to no-op on these failures in a best effort -- which I guess is a decision based solely the dev's requirements. So of course, I want to implement it 😈

christothes commented 11 months ago

You have to extract table name, PK, and RK from the HttpMessage.Request to fetch the remote state. It's kind of gnarly.

For methods like Update, at least, you can get the RK and PK from the URI

You'd need to sniff out whether 412 or 409 is the conflict case (i.e. was it a add entity or an upsert/upsert entity). This is probably based on the HttpMessage.Request If-* headers... complex

This could probably be optimized to be based on the HTTP verb for Update or Add

  1. You lose call site context so you have to sniff it from the HttpMessage.Request.

But another possible approach to retaining the call site context is to utilize CreateClientRequestIdScope. In theory, you could create some kind of mapping between your custom requestId value and the call site details.

joelverhagen commented 11 months ago

This is great. HttpPipeline.CreateHttpMessagePropertiesScope is even better (never knew of this before). I can plumb in everything I need.

Leaving this POC for posterity: https://gist.github.com/joelverhagen/bbf0bdd91cfcdb5784abf135a859a108

The problem gets nasty with $batch so I think I will use a ChangeId = <guid> column which can be used to check on a single property (of a single entity, in the case of $batch). That's probably what I'll do in my application. That concept translates better to blobs as well, where you won't want to do a full byte-for-byte comparison of the blob contents (where it's more feasible for column-for-column comparison of a Table entity). I can compare a x-ms-meta-changeid: <guid> blob metadata there.

It would still be nice to have ToOdataAnnotatedDictionary as a public extension method for more flexibility in implementation, e.g. it's very painful to get the property bag from a $batch request body, but I see there are other ways to avoid that blocker.

Thanks for your help. I'll close this since it appears quite feasible with HttpPipeline.CreateHttpMessagePropertiesScope + a custom RetryPolicy.