Open kundadebdatta opened 4 days ago
For better understanding, I have attached the diagnostics for the write request that took more than 20 mins to complete.
I think we need the meeting - to me this sounds like an issue even without PPAF enabled - because at the point in question RGW is simply down. So, I don't think any change filtering to scope to PPAF can be the right one? Or am I missing something in the repro instructions?
The only difference when PPAF is enabled vs disabled is that, there is a chance that the faulty partition will be failed over. So the retry might be successful, while without PPAF, there is no way that partition could fail over.
Pull Request Template
Description
Background:
During one of our recent backend DR drills, it was found that, when the primary region's routing gateway is experiencing an outage, and the server and master partition were taken down, the .NET v3 SDK kept retrying on the same region for
20
minutes, until the primary write region came back up. This issue have been identified as intermittent and titled as "hanging cosmos client".Account Setup For 3 regions : Create a cosmos account (single master) with 3 regions, WestUs2 (Write), EastUs2 (Read) and NorthCentralUS (Read). The PPAF configuration from the BE is to failover to EastUs2, in case WestUs2 is unavailable.
Scenario: While creating the cosmos client, in the application preferred region, provide WestUs2, EastUs2 and NorthCentralUS as preferred regions.
There were
2
different accounts (each with100
partitions), and40
different clients were set up in the same region (West US 2) per account, which boils down to total80
cosmos clients.There is only
1
cluster involved in each region i.e. all200
partitions for both accounts are onwestus2-be2
(and same forEastUs2
andNorthCentralUS
)After initializing and warming up the cosmos client, the cluster (stopped all Server+ Master nodes) was brought down at
2024-05-09 20:30
for8
minutes. This also brings down the Routing Gateway service in the West US 2 region.Current Behavior: The SDK keeps retrying on the region WestUs2 for writing the document (since WestUs2 is the single write region) and after keep retrying for
25
minutes, the attempt finally marks successful once the region came back up. This issue is identified as the request got "hung" for25
minutes. To understand this better, take a look at the below diagnostics, and the diagnostics that is attached in the comment section.Diagnostics Snippet - Scenario: Routing Gateway, along with Server and Master partitions are completely down.
```json "Summary": { "DirectCalls": { "(503, 20006)": 1, "(403, 3)": 19, "(410, 20001)": 99, "(201, 0)": 1, "(204, 0)": 1 }, "GatewayCalls": { "(200, 0)": 2, "(0, 0)": 9 } } ```Why Did This Happened?
Note: This is applicable for write requests in a single master write account. When the complete cluster (nodes, that hosts all the server partitions + master partition) is down due to an outage, and the routing gateway is not reachable, after some retries in the
GoneAndRetryWithRequestRetryPolicy
the write request finally fails with aTransportGenerated410
.Diagnostics Snippet - Transport Stack Failure.
```json { "ResponseTimeUTC": "2024-04-24T22:19:31.5361673Z", "ResourceType": "Document", "OperationType": "Create", "LocationEndpoint": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net/", "StoreResult": { "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "StatusCode": "Gone", "SubStatusCode": "TransportGenerated410", "LSN": -1, "PartitionKeyRangeId": null, "GlobalCommittedLSN": -1, "ItemLSN": -1, "UsingLocalLSN": false, "QuorumAckedLSN": -1, "SessionToken": null, "CurrentWriteQuorum": -1, "CurrentReplicaSetSize": -1, "NumberOfReadRegions": -1, "IsValid": false, "StorePhysicalAddress": "rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/apps/08a9385f-af01-40f2-bd90-10d4f17133a4/services/80c902b9-1d8f-48b2-a0fc-11a9e5bf8eed/partitions/f1f7ae05-77cd-4f52-b733-955cfb301b70/replicas/133554390203135475p/", "RequestCharge": 0, "RetryAfterInMs": null, "BELatencyInMs": null, "ReplicaHealthStatuses": [ "(port: 14351 | status: Connected | lkt: 4/24/2024 10:09:38 PM)" ], "transportRequestTimeline": { "requestTimeline": [ { "event": "Created", "startTimeUtc": "2024-04-24T22:19:26.9718530Z", "durationInMs": 0.002 }, { "event": "ChannelAcquisitionStarted", "startTimeUtc": "2024-04-24T22:19:26.9718550Z", "durationInMs": 4564.25 }, { "event": "Failed", "startTimeUtc": "2024-04-24T22:19:31.5361050Z", "durationInMs": 0 } ], "serviceEndpointStats": { "inflightRequests": 11, "openConnections": 1 }, "connectionStats": { "waitforConnectionInit": "True" } }, "TransportException": "A client transport error occurred: The connection attempt timed out. (Time: 2024-04-24T22:19:31.5358378Z, activity ID: f5c15e36-470d-44e0-9c41-61d2132b82af, error code: ConnectTimeout [0x0006], base error: HRESULT 0x80131500, URI: rntbd://cdb-ms-stage-westus2-be2.documents-staging.windows-ppe.net:14351/, connection:This triggers a force address refresh, which after
3
attempts fails with aHttpRequestException
since the gateway service itself is not rechable.Diagnostics Snippet - HTTP Response Stack.
```json "HttpResponseStats": [ { "StartTimeUTC": "2024-04-24T22:20:57.1125860Z", "DurationInMs": 505.6518, "RequestUri": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net//addresses/?$resolveFor=dbs%2fIkltAA%3d%3d%2fcolls%2fIkltALjPvaU%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=15", "ResourceType": "Document", "HttpMethod": "GET", "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "ExceptionType": "System.Threading.Tasks.TaskCanceledException", "ExceptionMessage": "A task was canceled." }, { "StartTimeUTC": "2024-04-24T22:20:57.6182567Z", "DurationInMs": 5009.6365, "RequestUri": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net//addresses/?$resolveFor=dbs%2fIkltAA%3d%3d%2fcolls%2fIkltALjPvaU%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=15", "ResourceType": "Document", "HttpMethod": "GET", "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "ExceptionType": "System.Threading.Tasks.TaskCanceledException", "ExceptionMessage": "A task was canceled." }, { "StartTimeUTC": "2024-04-24T22:21:03.6319840Z", "DurationInMs": 13527.8359, "RequestUri": "https://mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net//addresses/?$resolveFor=dbs%2fIkltAA%3d%3d%2fcolls%2fIkltALjPvaU%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=15", "ResourceType": "Document", "HttpMethod": "GET", "ActivityId": "c3f0c0fa-f198-4eea-95c8-adbc378b8559", "ExceptionType": "System.Net.Http.HttpRequestException", "ExceptionMessage": "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (mmankos-0404-ppaf-3-westus2.documents-staging.windows-ppe.net:443)" } ] ```This bubbles up to the
ClientRetryPolicy
which checks for theHttpRequestException
and triggeres another retry. Now, here is where the things get a little interesting. Since, the account is a single master, there are no write regions to fail over and the retry policy keep retrying in the same write region, which is theWestUS2
. This is the reason we saw the indefinite amount of retries, until the primary write region came back up again.Proposed Solution:
The reason for the PR is to change the
ClientRetryPolicy
behavior to add a partition level override, in case suchHttpRequestException
happens and PPAF is enabled. This will make sure to cover the below use case:Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #4564