kundadebdatta commented 4 days ago

Pull Request Template

Description

Background:

During one of our recent backend DR drills, it was found that, when the primary region's routing gateway is experiencing an outage, and the server and master partition were taken down, the .NET v3 SDK kept retrying on the same region for 20 minutes, until the primary write region came back up. This issue have been identified as intermittent and titled as "hanging cosmos client".

Account Setup For 3 regions : Create a cosmos account (single master) with 3 regions, WestUs2 (Write), EastUs2 (Read) and NorthCentralUS (Read). The PPAF configuration from the BE is to failover to EastUs2, in case WestUs2 is unavailable.

Scenario: While creating the cosmos client, in the application preferred region, provide WestUs2, EastUs2 and NorthCentralUS as preferred regions.

            CosmosClientOptions clientOptions = new CosmosClientOptions()
            {
                ApplicationPreferredRegions = new List<string>()
                {
                    Regions.WestUs2,
                    Regions.EastUs2,
                    Regions.NorthCentralUS
                },
                EnablePartitionLevelFailover = true,

There were 2 different accounts (each with 100 partitions), and 40 different clients were set up in the same region (West US 2) per account, which boils down to total 80 cosmos clients.

There is only 1 cluster involved in each region i.e. all 200 partitions for both accounts are on westus2-be2 (and same for EastUs2 and NorthCentralUS)

After initializing and warming up the cosmos client, the cluster (stopped all Server+ Master nodes) was brought down at 2024-05-09 20:30 for 8 minutes. This also brings down the Routing Gateway service in the West US 2 region.

Current Behavior: The SDK keeps retrying on the region WestUs2 for writing the document (since WestUs2 is the single write region) and after keep retrying for 25 minutes, the attempt finally marks successful once the region came back up. This issue is identified as the request got "hung" for 25 minutes. To understand this better, take a look at the below diagnostics, and the diagnostics that is attached in the comment section.

Diagnostics Snippet - Scenario: Routing Gateway, along with Server and Master partitions are completely down.

```json "Summary": { "DirectCalls": { "(503, 20006)": 1, "(403, 3)": 19, "(410, 20001)": 99, "(201, 0)": 1, "(204, 0)": 1 }, "GatewayCalls": { "(200, 0)": 2, "(0, 0)": 9 } } ```

Why Did This Happened?

Note: This is applicable for write requests in a single master write account. When the complete cluster (nodes, that hosts all the server partitions + master partition) is down due to an outage, and the routing gateway is not reachable, after some retries in the GoneAndRetryWithRequestRetryPolicy the write request finally fails with a TransportGenerated410.

Diagnostics Snippet - Transport Stack Failure.

```json { "ResponseTimeUTC": "2024-04-24T22:19:31.5361673Z", "ResourceType": "Document", "OperationType": "Create", "LocationEndpoint": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net/", "StoreResult": { "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "StatusCode": "Gone", "SubStatusCode": "TransportGenerated410", "LSN": -1, "PartitionKeyRangeId": null, "GlobalCommittedLSN": -1, "ItemLSN": -1, "UsingLocalLSN": false, "QuorumAckedLSN": -1, "SessionToken": null, "CurrentWriteQuorum": -1, "CurrentReplicaSetSize": -1, "NumberOfReadRegions": -1, "IsValid": false, "StorePhysicalAddress": "rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/apps/08a9385f-af01-40f2-bd90-10d4f17133a4/services/80c902b9-1d8f-48b2-a0fc-11a9e5bf8eed/partitions/f1f7ae05-77cd-4f52-b733-955cfb301b70/replicas/133554390203135475p/", "RequestCharge": 0, "RetryAfterInMs": null, "BELatencyInMs": null, "ReplicaHealthStatuses": [ "(port: 14351 | status: Connected | lkt: 4/24/2024 10:09:38 PM)" ], "transportRequestTimeline": { "requestTimeline": [ { "event": "Created", "startTimeUtc": "2024-04-24T22:19:26.9718530Z", "durationInMs": 0.002 }, { "event": "ChannelAcquisitionStarted", "startTimeUtc": "2024-04-24T22:19:26.9718550Z", "durationInMs": 4564.25 }, { "event": "Failed", "startTimeUtc": "2024-04-24T22:19:31.5361050Z", "durationInMs": 0 } ], "serviceEndpointStats": { "inflightRequests": 11, "openConnections": 1 }, "connectionStats": { "waitforConnectionInit": "True" } }, "TransportException": "A client transport error occurred: The connection attempt timed out. (Time: 2024-04-24T22:19:31.5358378Z, activity ID: f5c15e36-470d-44e0-9c41-61d2132b82af, error code: ConnectTimeout [0x0006], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/, connection: -> rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/, payload sent: False)" } } ```

This triggers a force address refresh, which after 3 attempts fails with a HttpRequestException since the gateway service itself is not rechable.

Diagnostics Snippet - HTTP Response Stack.

```json "HttpResponseStats": [ { "StartTimeUTC": "2024-04-24T22:20:57.1125860Z", "DurationInMs": 505.6518, "RequestUri": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net//addresses/?$resolveFor=dbs%2fIkltAA%3d%3d%2fcolls%2fIkltALjPvaU%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=15", "ResourceType": "Document", "HttpMethod": "GET", "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "ExceptionType": "System.Threading.Tasks.TaskCanceledException", "ExceptionMessage": "A task was canceled." }, { "StartTimeUTC": "2024-04-24T22:20:57.6182567Z", "DurationInMs": 5009.6365, "RequestUri": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net//addresses/?$resolveFor=dbs%2fIkltAA%3d%3d%2fcolls%2fIkltALjPvaU%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=15", "ResourceType": "Document", "HttpMethod": "GET", "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "ExceptionType": "System.Threading.Tasks.TaskCanceledException", "ExceptionMessage": "A task was canceled." }, { "StartTimeUTC": "2024-04-24T22:21:03.6319840Z", "DurationInMs": 13527.8359, "RequestUri": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net//addresses/?$resolveFor=dbs%2fIkltAA%3d%3d%2fcolls%2fIkltALjPvaU%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=15", "ResourceType": "Document", "HttpMethod": "GET", "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "ExceptionType": "System.Net.Http.HttpRequestException", "ExceptionMessage": "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net:443)" } ] ```

This bubbles up to the ClientRetryPolicy which checks for the HttpRequestException and triggeres another retry. Now, here is where the things get a little interesting. Since, the account is a single master, there are no write regions to fail over and the retry policy keep retrying in the same write region, which is the WestUS2. This is the reason we saw the indefinite amount of retries, until the primary write region came back up again.

Proposed Solution:

The reason for the PR is to change the ClientRetryPolicy behavior to add a partition level override, in case such HttpRequestException happens and PPAF is enabled. This will make sure to cover the below use case:

Provided that all of the server + master partition is down in the primary write region and the RG is non responsive, the backend can still failover the partition in the next region. Today, the SDK is restricted to retry on the primary write region, since the GET address requests are tied to a specific region. But, in order to make things optimal, the SDK should retry on the next region order by the account topology, by adding the partition level override, optimistically thinking that the failover has happened.

Type of change

Please delete options that are not relevant.

[x] Bug fix (non-breaking change which fixes an issue)

Closing issues

To automatically close an issue: closes #4564

kundadebdatta commented 4 days ago

For better understanding, I have attached the diagnostics for the write request that took more than 20 mins to complete.

req_ok_30m.json

kundadebdatta commented 2 days ago

I think we need the meeting - to me this sounds like an issue even without PPAF enabled - because at the point in question RGW is simply down. So, I don't think any change filtering to scope to PPAF can be the right one? Or am I missing something in the repro instructions?

The only difference when PPAF is enabled vs disabled is that, there is a chance that the faulty partition will be failed over. So the retry might be successful, while without PPAF, there is no way that partition could fail over.

Azure / azure-cosmos-dotnet-v3

[Internal] ClientRetryPolicy: Fixes Partition Failover on Next region when RG Fails with `HttpRequestException` #4565