Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
723 stars 477 forks source link

[Per Partition Automatic Failover] Failover Partition on Next region when RG Fails with `HttpRequestException` #4564

Open kundadebdatta opened 4 days ago

kundadebdatta commented 4 days ago

Description

Background:

During one of our recent backend DR drills, it was found that, when the primary region's routing gateway is experiencing an outage, and the server and master partition were taken down, the .NET v3 SDK kept retrying on the same region for 20 minutes, until the primary write region came back up. This issue have been identified as intermittent and titled as "hanging cosmos client".

Account Setup For 3 regions : Create a cosmos account (single master) with 3 regions, WestUs2 (Write), EastUs2 (Read) and NorthCentralUS (Read). The PPAF configuration from the BE is to failover to EastUs2, in case WestUs2 is unavailable.

Scenario: While creating the cosmos client, in the application preferred region, provide WestUs2, EastUs2 and NorthCentralUS as preferred regions.

            CosmosClientOptions clientOptions = new CosmosClientOptions()
            {
                ApplicationPreferredRegions = new List<string>()
                {
                    Regions.WestUs2,
                    Regions.EastUs2,
                    Regions.NorthCentralUS
                },
                EnablePartitionLevelFailover = true,

There were 2 different accounts (each with 100 partitions), and 40 different clients were set up in the same region (West US 2) per account, which boils down to total 80 cosmos clients.

There is only 1 cluster involved in each region i.e. all 200 partitions for both accounts are on westus2-be2 (and same for EastUs2 and NorthCentralUS)

After initializing and warming up the cosmos client, the cluster (stopped all Server+ Master nodes) was brought down at 2024-05-09 20:30 for 8 minutes. This also brings down the Routing Gateway service in the West US 2 region.

Current Behavior: The SDK keeps retrying on the region WestUs2 for writing the document (since WestUs2 is the single write region) and after keep retrying for 25 minutes, the attempt finally marks successful once the region came back up. This issue is identified as the request got "hung" for 25 minutes. To understand this better, take a look at the below diagnostics:

Diagnostics Snippet - Scenario: Routing Gateway, along with Server and Master partitions are completely down. ```json { "timestamp": "2024-04-24 22:16:23.8883769", "clientId": 0, "attempts": 404, "successes": 396, "status": 201, "substatus": 0, "contactedRegions": [ "West US 2", "North Central US", "East US 2" ], "diagnostics": { "Summary": { "DirectCalls": { "(503, 20006)": 1, "(403, 3)": 19, "(410, 20001)": 99, "(201, 0)": 1, "(204, 0)": 1 }, "GatewayCalls": { "(200, 0)": 2, "(0, 0)": 9 } }, "name": "CreateItemAsync", "start datetime": "2024-04-24T22:16:23.888Z", "duration in milliseconds": 2228617.5198, "data": { "Client Configuration": { "Client Created Time Utc": "2024-04-24T22:09:36.1087494Z", "MachineId": "vmId:e1b9038a-a228-4212-8975-58c50fa37ff4", "VM Region": "eastus", "NumberOfClientsCreated": 1, "NumberOfActiveClients": 1, "ConnectionMode": "Direct", "User Agent": "cosmos-netstandard-sdk/24.14.2|1|X64|Microsoft Windows 10.0.22631|.NET 6.0.29|L|", "ConnectionConfig": { "gw": "(cps:50, urto:6, p:False, httpf: False)", "rntbd": "(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)", "other": "(ed:False, be:False)" }, "ConsistencyConfig": "(consistency: NotSet, prgns:[West US 2, East US 2, North Central US], apprgn: )", "ProcessorCount": 16 } }, "children": [ { "name": "ItemSerialize", "duration in milliseconds": 0.059 }, { "name": "Get PkValue From Stream", "duration in milliseconds": 0.0772, "children": [ { "name": "Get Collection Cache", "duration in milliseconds": 0.0015 } ] }, { "name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler", "duration in milliseconds": 2228617.2795, "children": [ { "name": "Get Collection Cache", "duration in milliseconds": 0.0004 }, { "name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler", "duration in milliseconds": 2228617.2431, "data": { "System Info": { "systemHistory": [ { "dateUtc": "2024-04-24T22:52:39.9986802Z", "cpu": 63.066, "memory": 27080876.000, "threadInfo": { "isThreadStarving": "False", "threadWaitIntervalInMs": 0.051, "availableThreads": 32731, "minThreads": 16, "maxThreads": 32767 }, "numberOfOpenTcpConnection": 344 }, { "dateUtc": "2024-04-24T22:52:49.9991262Z", "cpu": 34.424, "memory": 28758680.000, "threadInfo": { "isThreadStarving": "True", "threadWaitIntervalInMs": 10000.4658, "availableThreads": 32749, "minThreads": 16, "maxThreads": 32767 }, "numberOfOpenTcpConnection": 428 }, { "dateUtc": "2024-04-24T22:53:00.0147595Z", "cpu": 26.365, "memory": 32764952.000, "threadInfo": { "isThreadStarving": "True", "threadWaitIntervalInMs": 20016.1211, "availableThreads": 32728, "minThreads": 16, "maxThreads": 32767 }, "numberOfOpenTcpConnection": 480 }, { "dateUtc": "2024-04-24T22:53:10.0308936Z", "cpu": 52.937, "memory": 32207684.000, "threadInfo": { "isThreadStarving": "True", "threadWaitIntervalInMs": 30032.2772, "availableThreads": 32746, "minThreads": 16, "maxThreads": 32767 }, "numberOfOpenTcpConnection": 599 }, { "dateUtc": "2024-04-24T22:53:20.0538398Z", "cpu": 49.795, "memory": 36020124.000, "threadInfo": { "isThreadStarving": "True", "threadWaitIntervalInMs": 40055.2454, "availableThreads": 32747, "minThreads": 16, "maxThreads": 32767 }, "numberOfOpenTcpConnection": 625 }, { "dateUtc": "2024-04-24T22:53:30.0784956Z", "cpu": 47.215, "memory": 36624688.000, "threadInfo": { "isThreadStarving": "True", "threadWaitIntervalInMs": 50079.9233, "availableThreads": 32746, "minThreads": 16, "maxThreads": 32767 }, "numberOfOpenTcpConnection": 662 } ] } }, "children": [ { "name": "Microsoft.Azure.Cosmos.Handlers.TelemetryHandler", "duration in milliseconds": 2228617.2296, "children": [ { "name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler", "duration in milliseconds": 2228617.2261, "children": [ { "name": "Microsoft.Azure.Cosmos.Handlers.RouterHandler", "duration in milliseconds": 5151.6177, "children": [ { "name": "Microsoft.Azure.Cosmos.Handlers.TransportHandler", "duration in milliseconds": 5151.6152, "children": [ { "name": "Microsoft.Azure.Documents.ServerStoreModel Transport Request", "duration in milliseconds": 5150.6828, "data": { "Client Side Request Stats": { "Id": "AggregatedClientSideRequestStatistics", "ContactedReplicas": [ { "Count": 1, "Uri": "rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/apps/08a9385f-af01-40f2-bd90-10d4f17133a4/services/80c902b9-1d8f-48b2-a0fc-11a9e5bf8eed/partitions/f1f7ae05-77cd-4f52-b733-955cfb301b70/replicas/133554390203135475p/" }, { "Count": 1, "Uri": "rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14008/apps/08a9385f-af01-40f2-bd90-10d4f17133a4/services/80c902b9-1d8f-48b2-a0fc-11a9e5bf8eed/partitions/f1f7ae05-77cd-4f52-b733-955cfb301b70/replicas/133554390203135477s/" }, { "Count": 1, "Uri": "rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14059/apps/08a9385f-af01-40f2-bd90-10d4f17133a4/services/80c902b9-1d8f-48b2-a0fc-11a9e5bf8eed/partitions/f1f7ae05-77cd-4f52-b733-955cfb301b70/replicas/133554390203135476s/" }, { "Count": 1, "Uri": "rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14364/apps/08a9385f-af01-40f2-bd90-10d4f17133a4/services/80c902b9-1d8f-48b2-a0fc-11a9e5bf8eed/partitions/f1f7ae05-77cd-4f52-b733-955cfb301b70/replicas/133574122218444102s/" } ], "RegionsContacted": [], "FailedReplicas": [], "AddressResolutionStatistics": [], "StoreResponseStatistics": [ { "ResponseTimeUTC": "2024-04-24T22:16:29.0389674Z", "ResourceType": "Document", "OperationType": "Create", "LocationEndpoint": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net/", "StoreResult": { "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "StatusCode": "ServiceUnavailable", "SubStatusCode": "Channel_Closed", "LSN": -1, "PartitionKeyRangeId": null, "GlobalCommittedLSN": -1, "ItemLSN": -1, "UsingLocalLSN": false, "QuorumAckedLSN": -1, "SessionToken": null, "CurrentWriteQuorum": -1, "CurrentReplicaSetSize": -1, "NumberOfReadRegions": -1, "IsValid": false, "StorePhysicalAddress": "rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/apps/08a9385f-af01-40f2-bd90-10d4f17133a4/services/80c902b9-1d8f-48b2-a0fc-11a9e5bf8eed/partitions/f1f7ae05-77cd-4f52-b733-955cfb301b70/replicas/133554390203135475p/", "RequestCharge": 0, "RetryAfterInMs": null, "BELatencyInMs": null, "ReplicaHealthStatuses": [ "(port: 14351 | status: Connected | lkt: 4/24/2024 10:09:38 PM)" ], "transportRequestTimeline": { "requestTimeline": [ { "event": "Created", "startTimeUtc": "2024-04-24T22:16:23.8887179Z", "durationInMs": 0.0035 }, { "event": "ChannelAcquisitionStarted", "startTimeUtc": "2024-04-24T22:16:23.8887214Z", "durationInMs": 0.0152 }, { "event": "Pipelined", "startTimeUtc": "2024-04-24T22:16:23.8887366Z", "durationInMs": 0.1556 }, { "event": "Transit Time", "startTimeUtc": "2024-04-24T22:16:23.8888922Z", "durationInMs": 5147.9765 }, { "event": "Failed", "startTimeUtc": "2024-04-24T22:16:29.0368687Z", "durationInMs": 0 } ], "serviceEndpointStats": { "inflightRequests": 1, "openConnections": 1 }, "connectionStats": { "waitforConnectionInit": "False", "callsPendingReceive": 0, "lastSendAttempt": "2024-04-24T22:15:48.8207325Z", "lastSend": "2024-04-24T22:15:48.8208227Z", "lastReceive": "2024-04-24T22:15:48.8961258Z" }, "requestSizeInBytes": 988, "requestBodySizeInBytes": 553 }, "TransportException": "A client transport error occurred: The connection failed. (Time: 2024-04-24T22:16:29.0366808Z, activity ID: c3f0c0fa-f198-4eea-95c8-adbc378b8559, error code: ConnectionBroken [0x0012], base error: socket error ConnectionReset [0x00002746], URI: rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/apps/08a9385f-af01-40f2-bd90-10d4f17133a4/services/80c902b9-1d8f-48b2-a0fc-11a9e5bf8eed/partitions/f1f7ae05-77cd-4f52-b733-955cfb301b70/replicas/133554390203135475p/, connection: 10.0.0.99:54893 -> 40.64.135.3:14351, payload sent: True)" } } ] }, "Point Operation Statistics": { "Id": "PointOperationStatistics", "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "ResponseTimeUtc": "2024-04-24T22:16:29.0396719Z", "StatusCode": 503, "SubStatusCode": 20006, "RequestCharge": 0, "RequestUri": "dbs/db/colls/ct", "ErrorMessage": "Microsoft.Azure.Documents.ServiceUnavailableException: Channel is closed\r\nActivityId: c3f0c0fa-f198-4eea-95c8-adbc378b8559, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Windows/10.0.22631 cosmos-netstandard-sdk/3.32.1\r\n ---> Microsoft.Azure.Documents.TransportException: A client transport error occurred: The connection failed. (Time: 2024-04-24T22:16:29.0366808Z, activity ID: c3f0c0fa-f198-4eea-95c8-adbc378b8559, error code: ConnectionBroken [0x0012], base error: socket error ConnectionReset [0x00002746], URI: rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/apps/08a9385f-af01-40f2-bd90-10d4f17133a4/services/80c902b9-1d8f-48b2-a0fc-11a9e5bf8eed/partitions/f1f7ae05-77cd-4f52-b733-955cfb301b70/replicas/133554390203135475p/, connection: 10.0.0.99:54893 -> 40.64.135.3:14351, payload sent: True)\r\n ---> Microsoft.Azure.Documents.TransportException: A client transport error occurred: Failed to read the server response. (Time: 2024-04-24T22:16:29.0350828Z, activity ID: 00000000-0000-0000-0000-000000000000, error code: ReceiveFailed [0x000F], base error: socket error ConnectionReset [0x00002746], URI: rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/, connection: 10.0.0.99:54893 -> 40.64.135.3:14351, payload sent: True)\r\n ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host..\r\n ---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host.\r\n --- End of inner exception stack trace ---\r\n at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)\r\n at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)\r\n at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory`1 buffer)\r\n at Microsoft.Azure.Cosmos.Rntbd.RntbdStreamReader.ReadStreamAsync(Byte[] buffer, Int32 offset, Int32 count)\r\n at Microsoft.Azure.Cosmos.Rntbd.RntbdStreamReader.PopulateBytesAndReadAsync(Byte[] payload, Int32 offset, Int32 count)\r\n at Microsoft.Azure.Documents.Rntbd.Connection.ReadPayloadAsync(Byte[] payload, Int32 length, String type, ChannelCommonArguments args)\r\n --- End of inner exception stack trace ---\r\n at Microsoft.Azure.Documents.Rntbd.Connection.TraceAndThrowReceiveFailedException(IOException e, String type, ChannelCommonArguments args)\r\n at Microsoft.Azure.Documents.Rntbd.Connection.ReadPayloadAsync(Byte[] payload, Int32 length, String type, ChannelCommonArguments args)\r\n at Microsoft.Azure.Documents.Rntbd.Connection.ReadResponseMetadataAsync(ChannelCommonArguments args)\r\n at Microsoft.Azure.Documents.Rntbd.Dispatcher.ReceiveLoopAsync()\r\n --- End of inner exception stack trace ---\r\n at Microsoft.Azure.Documents.Rntbd.Dispatcher.CallAsync(ChannelCallArguments args, TransportRequestStats transportRequestStats)\r\n at Microsoft.Azure.Documents.Rntbd.Channel.RequestAsync(DocumentServiceRequest request, TransportAddressUri physicalAddress, ResourceOperation resourceOperation, Guid activityId, TransportRequestStats transportRequestStats)\r\n at Microsoft.Azure.Documents.Rntbd.LoadBalancingPartition.RequestAsync(DocumentServiceRequest request, TransportAddressUri physicalAddress, ResourceOperation resourceOperation, Guid activityId, TransportRequestStats transportRequestStats)\r\n at Microsoft.Azure.Documents.Rntbd.TransportClient.InvokeStoreAsync(TransportAddressUri physicalAddress, ResourceOperation resourceOperation, DocumentServiceRequest request)\r\n --- End of inner exception stack trace ---\r\n at Microsoft.Azure.Documents.Rntbd.TransportClient.InvokeStoreAsync(TransportAddressUri physicalAddress, ResourceOperation resourceOperation, DocumentServiceRequest request)\r\n at Microsoft.Azure.Documents.ConsistencyWriter.WritePrivateAsync(DocumentServiceRequest request, TimeoutHelper timeout, Boolean forceRefresh)\r\n at Microsoft.Azure.Documents.StoreResult.VerifyCanContinueOnException(DocumentClientException ex)\r\n at Microsoft.Azure.Documents.ConsistencyWriter.WritePrivateAsync(DocumentServiceRequest request, TimeoutHelper timeout, Boolean forceRefresh)\r\n at Microsoft.Azure.Documents.BackoffRetryUtility`1.ExecuteRetryAsync[TParam,TPolicy](Func`1 callbackMethod, Func`3 callbackMethodWithParam, Func`2 callbackMethodWithPolicy, TParam param, IRetryPolicy retryPolicy, IRetryPolicy`1 retryPolicyWithArg, Func`1 inBackoffAlternateCallbackMethod, Func`2 inBackoffAlternateCallbackMethodWithPolicy, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action`1 preRetryCallback)\r\n at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)\r\n at Microsoft.Azure.Documents.BackoffRetryUtility`1.ExecuteRetryAsync[TParam,TPolicy](Func`1 callbackMethod, Func`3 callbackMethodWithParam, Func`2 callbackMethodWithPolicy, TParam param, IRetryPolicy retryPolicy, IRetryPolicy`1 retryPolicyWithArg, Func`1 inBackoffAlternateCallbackMethod, Func`2 inBackoffAlternateCallbackMethodWithPolicy, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action`1 preRetryCallback)\r\n at Microsoft.Azure.Documents.BackoffRetryUtility`1.ExecuteRetryAsync[TParam,TPolicy](Func`1 callbackMethod, Func`3 callbackMethodWithParam, Func`2 callbackMethodWithPolicy, TParam param, IRetryPolicy retryPolicy, IRetryPolicy`1 retryPolicyWithArg, Func`1 inBackoffAlternateCallbackMethod, Func`2 inBackoffAlternateCallbackMethodWithPolicy, TimeSpan minBackoffForInBackoffCallback, CancellationToken cancellationToken, Action`1 preRetryCallback)\r\n at Microsoft.Azure.Documents.ConsistencyWriter.WriteAsync(DocumentServiceRequest entity, TimeoutHelper timeout, Boolean forceRefresh, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Documents.ReplicatedResourceClient.<>c__DisplayClass31_0.<b__0>d.MoveNext()\r\n--- End of stack trace from previous location ---\r\n at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)\r\n at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)\r\n at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)\r\n at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)\r\n at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy)\r\n at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken)\r\n at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)", "RequestSessionToken": null, "ResponseSessionToken": null, "BELatencyInMs": null } } } ] } ] }, .... ] }, "errorMessage": "" } ```

Why Did This Happened?

Note: This is applicable for write requests in a single master write account. When the complete cluster (nodes, that hosts all the server partitions + master partition) is down due to an outage, and the routing gateway is not reachable, after some retries in the GoneAndRetryWithRequestRetryPolicy the write request finally fails with a TransportGenerated410.

Diagnostics Snippet - Transport Stack Failure. ```json { "ResponseTimeUTC": "2024-04-24T22:19:31.5361673Z", "ResourceType": "Document", "OperationType": "Create", "LocationEndpoint": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net/", "StoreResult": { "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "StatusCode": "Gone", "SubStatusCode": "TransportGenerated410", "LSN": -1, "PartitionKeyRangeId": null, "GlobalCommittedLSN": -1, "ItemLSN": -1, "UsingLocalLSN": false, "QuorumAckedLSN": -1, "SessionToken": null, "CurrentWriteQuorum": -1, "CurrentReplicaSetSize": -1, "NumberOfReadRegions": -1, "IsValid": false, "StorePhysicalAddress": "rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/apps/08a9385f-af01-40f2-bd90-10d4f17133a4/services/80c902b9-1d8f-48b2-a0fc-11a9e5bf8eed/partitions/f1f7ae05-77cd-4f52-b733-955cfb301b70/replicas/133554390203135475p/", "RequestCharge": 0, "RetryAfterInMs": null, "BELatencyInMs": null, "ReplicaHealthStatuses": [ "(port: 14351 | status: Connected | lkt: 4/24/2024 10:09:38 PM)" ], "transportRequestTimeline": { "requestTimeline": [ { "event": "Created", "startTimeUtc": "2024-04-24T22:19:26.9718530Z", "durationInMs": 0.002 }, { "event": "ChannelAcquisitionStarted", "startTimeUtc": "2024-04-24T22:19:26.9718550Z", "durationInMs": 4564.25 }, { "event": "Failed", "startTimeUtc": "2024-04-24T22:19:31.5361050Z", "durationInMs": 0 } ], "serviceEndpointStats": { "inflightRequests": 11, "openConnections": 1 }, "connectionStats": { "waitforConnectionInit": "True" } }, "TransportException": "A client transport error occurred: The connection attempt timed out. (Time: 2024-04-24T22:19:31.5358378Z, activity ID: f5c15e36-470d-44e0-9c41-61d2132b82af, error code: ConnectTimeout [0x0006], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/, connection: -> rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/, payload sent: False)" } } ```

This triggers a force address refresh, which after 3 attempts fails with a HttpRequestException since the gateway service itself is not rechable.

Diagnostics Snippet - HTTP Response Stack. ```json "HttpResponseStats": [ { "StartTimeUTC": "2024-04-24T22:20:57.1125860Z", "DurationInMs": 505.6518, "RequestUri": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net//addresses/?$resolveFor=dbs%2fIkltAA%3d%3d%2fcolls%2fIkltALjPvaU%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=15", "ResourceType": "Document", "HttpMethod": "GET", "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "ExceptionType": "System.Threading.Tasks.TaskCanceledException", "ExceptionMessage": "A task was canceled." }, { "StartTimeUTC": "2024-04-24T22:20:57.6182567Z", "DurationInMs": 5009.6365, "RequestUri": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net//addresses/?$resolveFor=dbs%2fIkltAA%3d%3d%2fcolls%2fIkltALjPvaU%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=15", "ResourceType": "Document", "HttpMethod": "GET", "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "ExceptionType": "System.Threading.Tasks.TaskCanceledException", "ExceptionMessage": "A task was canceled." }, { "StartTimeUTC": "2024-04-24T22:21:03.6319840Z", "DurationInMs": 13527.8359, "RequestUri": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net//addresses/?$resolveFor=dbs%2fIkltAA%3d%3d%2fcolls%2fIkltALjPvaU%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=15", "ResourceType": "Document", "HttpMethod": "GET", "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "ExceptionType": "System.Net.Http.HttpRequestException", "ExceptionMessage": "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net:443)" } ] ```

This bubbles up to the ClientRetryPolicy which checks for the HttpRequestException and triggeres another retry. Now, here is where the things get a little interesting. Since, the account is a single master, there are no write regions to fail over and the retry policy keep retrying in the same write region, which is the WestUS2. This is the reason we saw the indefinite amount of retries, until the primary write region came back up again.

Proposed Solution:

The reason for the PR is to change the ClientRetryPolicy behavior to add a partition level override, in case such HttpRequestException happens and PPAF is enabled. This will make sure to cover the below use case:

Type of change

Please delete options that are not relevant.

Closing issues

To automatically close an issue: closes #4181