Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
741 stars 494 forks source link

FeedIterator ReadNextAsync is getting ServiceUnavailable (503); Substatus: 20001 after upgrading cosmos sdk nuget #4337

Closed oshvartz closed 6 months ago

oshvartz commented 8 months ago

We are continuously addressing and improving the SDK, if possible, make sure the problem persist in the latest SDK version.

Describe the bug After updating .net sdk from 3.31.1 to 3.35.4 we started getting ServiceUnavailable (503); Substatus: 20001 on FeedIterator ReadNextAsync To Reproduce It's happening randomly only for calls from server on azure east us calling to cosmos account on south east asia Expected behavior Not throwing errors and retrieving next results page Actual behavior throws ServiceUnavailable (503); Substatus: 20001 ActivityId: 936c91da-9f89-4a18-96de-740cd00ee62b; Reason: (The request failed because the client was unable to establish connections to 4 endpoints across 1 regions. Please check for client resource starvation issues and verify connectivity between client and server

Environment summary SDK Version: 3.35.4 OS Version (e.g. Windows, Linux, MacOSX) Windows .net 6

Additional context verify SNAT ports looking at diagnostics we see no thread starvation and no high CPU using direct mode and TokenCredential for authenticate the cosmos account has only one region defined DirectCalls\":["(410, 20001)\":5 seems we get status Gone (410)

ealsur commented 8 months ago

@oshvartz Can you please share the full exception/diagnostics?

oshvartz commented 8 months ago

some of it PII so I will mask it Response status code does not indicate success: ServiceUnavailable (503); Substatus: 20001; ActivityId: 936c91da-9f89-4a18-96de-740cd00ee62b; Reason: (The request failed because the client was unable to establish connections to 4 endpoints across 1 regions. Please check for client resource starvation issues and verify connectivity between client and server. More info: https://aka.ms/cosmosdb-tsg-service-unavailable\r\nActivityId: 936c91da-9f89-4a18-96de-740cd00ee62b, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Windows/10.0.20348 cosmos-netstandard-sdk/3.32.0);; Diagnostics:[\"Summary\":[\"DirectCalls\":[\"(410, 20001)\":5],\"GatewayCalls\":[\"(200, 0)\":2]],\"name\":\"Typed FeedIterator ReadNextAsync\",\"start datetime\":\"2024-03-04T12:11:26.133Z\",\"duration in milliseconds\":34776.1538,\"data\":[\"Client Configuration\":[\"Client Created Time Utc\":\"2024-03-04T10:42:16.0517487Z\",\"MachineId\":\"vmId:34250944-e978-46b5-92c0-40c70200594f\",\"VM Region\":\"eastus\",\"NumberOfClientsCreated\":20,\"NumberOfActiveClients\":20,\"ConnectionMode\":\"Direct\",\"User Agent\":\"cosmos-netstandard-sdk/3.38.1|10|X64|Microsoft Windows 10.0.20348|.NET 6.0.2|N|\",\"ConnectionConfig\":[\"gw\":\"(cps:50, urto:6, p:False, httpf: False)\",\"rntbd\":\"(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)\",\"other\":\"(ed:False, be:False)\"],\"ConsistencyConfig\":\"(consistency: NotSet, prgns:[], apprgn: )\",\"ProcessorCount\":16],\"Query Correlated ActivityId\":\"fb63afa0-c3b2-41c4-9050-49e1242dc503\"],\"children\":[[\"name\":\"Create Query Pipeline\",\"duration in milliseconds\":0.0985,\"children\":[[\"name\":\"Get Container Properties\",\"duration in milliseconds\":0.0157,\"children\":[[\"name\":\"Get Collection Cache\",\"duration in milliseconds\":0.0013]]]]],[\"name\":\"Get Partition Key Ranges\",\"duration in milliseconds\":0.0358,\"children\":[[\"name\":\"Try Get Overlapping Ranges\",\"duration in milliseconds\":0.021]]],[\"name\":\"[,FF) move next\",\"duration in milliseconds\":34775.6354,\"children\":[[\"name\":\"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler\",\"duration in milliseconds\":34775.2689,\"children\":[[\"name\":\"Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler\",\"duration in milliseconds\":34775.2349,\"data\":[\"System Info\":[\"systemHistory\":[[\"dateUtc\":\"2024-03-04T12:11:02.0148100Z\",\"cpu\":16.888,\"memory\":27545132.000,\"threadInfo\":[\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0554,\"availableThreads\":32766,\"minThreads\":16,\"maxThreads\":32767],\"numberOfOpenTcpConnection\":294],[\"dateUtc\":\"2024-03-04T12:11:12.0305392Z\",\"cpu\":15.610,\"memory\":27786328.000,\"threadInfo\":[\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0536,\"availableThreads\":32762,\"minThreads\":16,\"maxThreads\":32767],\"numberOfOpenTcpConnection\":294],[\"dateUtc\":\"2024-03-04T12:11:22.0461804Z\",\"cpu\":17.853,\"memory\":27739792.000,\"threadInfo\":[\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0249,\"availableThreads\":32766,\"minThreads\":16,\"maxThreads\":32767],\"numberOfOpenTcpConnection\":294],[\"dateUtc\":\"2024-03-04T12:11:32.0617628Z\",\"cpu\":22.943,\"memory\":27695196.000,\"threadInfo\":[\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0501,\"availableThreads\":32766,\"minThreads\":16,\"maxThreads\":32767],\"numberOfOpenTcpConnection\":294],[\"dateUtc\":\"2024-03-04T12:11:42.0770009Z\",\"cpu\":21.402,\"memory\":27630724.000,\"threadInfo\":[\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.2541,\"availableThreads\":32765,\"minThreads\":16,\"maxThreads\":32767],\"numberOfOpenTcpConnection\":294],[\"dateUtc\":\"2024-03-04T12:11:52.0917364Z\",\"cpu\":26.024,\"memory\":27632252.000,\"threadInfo\":[\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0977,\"availableThreads\":32766,\"minThreads\":16,\"maxThreads\":32767],\"numberOfOpenTcpConnection\":294]]]],\"children\":[[\"name\":\"Microsoft.Azure.Cosmos.Handlers.TelemetryHandler\",\"duration in milliseconds\":34775.228,\"children\":[[\"name\":\"Microsoft.Azure.Cosmos.Handlers.RetryHandler\",\"duration in milliseconds\":34775.2217,\"children\":[[\"name\":\"Microsoft.Azure.Cosmos.Handlers.RouterHandler\",\"duration in milliseconds\":34775.1692,\"children\":[[\"name\":\"Microsoft.Azure.Cosmos.Handlers.TransportHandler\",\"duration in milliseconds\":34775.1656,\"children\":[[\"name\":\"Microsoft.Azure.Documents.ServerStoreModel Transport Request\",\"duration in milliseconds\":34774.248,\"data\":[\"Client Side Request Stats\":[\"Id\":\"AggregatedClientSideRequestStatistics\",\"ContactedReplicas\":[],\"RegionsContacted\":[],\"FailedReplicas\":[\"rntbd://**8:14017/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133490440942662799p\",\"rntbd://cdb-ms-prod-southeastasia1-fd13-az03.documents.azure.com:14050/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133525767419518240s\",\"rntbd://cdb-ms-prod-southeastasia1-fd13-az01.documents.azure.com:14332/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133525446851132963s\",\"rntbd://cdb-ms-prod-southeastasia1-fd13-az02.documents.azure.com:14049/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133525466106987757s\"],\"ForceAddressRefresh\":[[\"No change to cache\":[\"rntbd://cdb-ms-prod-southeastasia1-fd13-az03.documents.azure.com:14017/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133490440942662799p\",\"rntbd://cdb-ms-prod-southeastasia1-fd13-az01.documents.azure.com:14332/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133525446851132963s\",\"rntbd://cdb-ms-prod-southeastasia1-fd13-az02.documents.azure.com:14049/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133525466106987757s\",\"rntbd://cdb-ms-prod-southeastasia1-fd13-az03.documents.azure.com:14050/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133525767419518240s\"]],[\"No change to cache\":[\"rntbd://cdb-ms-prod-southeastasia1-fd13-az03.documents.azure.com:14017/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133490440942662799p\",\"rntbd://cdb-ms-prod-southeastasia1-fd13-az01.documents.azure.com:14332/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133525446851132963s\",\"rntbd://cdb-ms-prod-southeastasia1-fd13-az02.documents.azure.com:14049/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133525466106987757s\",\"rntbd://cdb-ms-prod-southeastasia1-fd13-az03.documents.azure.com:14050/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133525767419518240s\"]]],\"AddressResolutionStatistics\":[[\"StartTimeUTC\":\"2024-03-04T12:11:32.9063699Z\",\"EndTimeUTC\":\"2024-03-04T12:11:33.1339908Z\",\"TargetEndpoint\":\"https://geo-asia-asisf-prod-db.documents.azure.com//addresses/?$resolveFor=dbs%2f8iMIAA%3d%3d%2fcolls%2f8iMIAKYo9Hc%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0\"],[\"StartTimeUTC\":\"2024-03-04T12:11:53.9059232Z\",\"EndTimeUTC\":\"2024-03-04T12:11:54.1312003Z\",\"TargetEndpoint\":\"https://*****//addresses/?$resolveFor=dbs%2f8iMIAA%3d%3d%2fcolls%2f8iMIAKYo9Hc%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0\"]],\"StoreResponseStatistics\":[[\"ResponseTimeUTC\":\"2024-03-04T12:11:32.9060438Z\",\"ResourceType\":\"Document\",\"OperationType\":\"Query\",\"RequestSessionToken\":\"0:15#6621133\",\"LocationEndpoint\":\"https://*/\",\"StoreResult\":[\"ActivityId\":\"936c91da-9f89-4a18-96de-740cd00ee62b\",\"StatusCode\":\"Gone\",\"SubStatusCode\":\"TransportGenerated410\",\"LSN\":-1,\"PartitionKeyRangeId\":null,\"GlobalCommittedLSN\":-1,\"ItemLSN\":-1,\"UsingLocalLSN\":true,\"QuorumAckedLSN\":-1,\"SessionToken\":null,\"CurrentWriteQuorum\":-1,\"CurrentReplicaSetSize\":-1,\"NumberOfReadRegions\":-1,\"IsValid\":false,\"StorePhysicalAddress\":\"rntbd://cdb-ms-prod-southeastasia1-fd13-az03.documents.azure.com:14017/apps/75084cf5-ad02-4e67-94ce-893215572b76/services/1a857c93-bffc-46d2-ad92-dc77354b8dd8/partitions/ce520ef0-bbaf-42d9-b2a5-ac15bce6ba96/replicas/133490440942662799p\",\"RequestCharge\":0,\"RetryAfterInMs\":null,\"BELatencyInMs\":null,\"ReplicaHealthStatuses\":[\"(port: 14017 | status: UnhealthyPending | lkt: 3/4/2024 12:10:31 PM)\",\"(port: 14332 | status: UnhealthyPending | lkt: 3/4/2024 12:10:31 PM)\",\"(port: 14049 | status: UnhealthyPending | lkt: 3/4/2024 12:10:31 PM)\",\"(port: 14050 | status: UnhealthyPending | lkt: 3/4/2024 12:10:31 PM)\"],\"transportRequestTimeline\":[\"requestTimeline\":[[\"event\": \"Created\", \"startTimeUtc\": \"2024-03-04T12:11:26.1337653Z\", \"durationInMs\": 0.0059],[\"event\": \"ChannelAcquisitionStarted\", \"startTimeUtc\": \"2024-03-04T12:11:26.1337712Z\", \"durationInMs\": 0.0209],[\"event\": \"Pipelined\", \"startTimeUtc\": \"2024-03-04T12:11:26.1337921Z\", \"durationInMs\": 0.1109],[\"event\": \"Transit Time\", \"startTimeUtc\": \"2024-03-04T12:11:26.1339030Z\", \"durationInMs\": 6771.9708],[\"event\": \"Failed\", \"startTimeUtc\": \"2024-03-04T12:11:32.9058738Z\", \"durationInMs\": 0]],\"serviceEndpointStats\":[\"inflightRequests\":1,\"openConnections\":1],\"connectionStats\":[\"waitforConnectionInit\":\"False\",\"callsPendingReceive\":0,\"lastSendAttempt\":\"2024-03-04T12:10:42.3068110Z\",\"lastSend\":\"2024-03-04T12:10:42.3068546Z\",\"lastReceive\":\"2024-03-04T12:10:42.5291730Z\"],\"requestSi"

ealsur commented 8 months ago

Have you looked into the aka.ms link in the exception? https://aka.ms/cosmosdb-tsg-service-unavailable

"NumberOfClientsCreated":20,"NumberOfActiveClients":20

There are 20 client instances created, is this by design? Is the application working with 20 different Cosmos DB accounts? Reference: https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-dotnet-sdk-request-timeout?tabs=cpu-new#create-multiple-client-instances

server on azure east us calling to cosmos account on south east asia

Network in-between regions goes through unmonitored network infrastructure, it can face network problems, high latency and timeouts. Has this been always the deployment infrastructure?

After updating .net sdk from 3.31.1 to 3.35.4

Is the SDK update the only change that was done? If so, does rolling it back solve the issue? The error is network related, unless there were any other code changes beyond the SDK that might be playing a part (like the increase in client instances).

ealsur commented 8 months ago

To clarify, nothing in the exception diagnostics shows the SDK is doing anything wrong. It is facing timeouts, retrying, and attempting to rediscover addresses.

oshvartz commented 8 months ago

@ealsur just few answers - we did try following the TSG we have two application versions - one using sdk 3.31.1 and one upgraded Microsoft.Azure.Cosmos to 3.35.4 - only then we see the issue - rolling back the issue disappears it's happing in all VMs in the region and only for this account - maybe some configuration\timeout was change it's very reproducible getting ~60-70 errors per hour can you advise what can we do to overcome this issue one thing that we change is using the PartitionKeyBuilder.Add(partition value) instead of new PartitionKey(the partition value) when creating QueryRequestOptions - could it impact performance ?

ealsur commented 8 months ago

Is the issue only affecting the queries that now use PartitionKeyBuilder? Are other queries working well? Are the queries with PartitionKeyBuilder working sometimes too?

What about the Number of Clients? There are 20, is this expected?

One potential reason is: Queries with Hierarchical Partition Key (PartitionKeyBuilder) can be cross-partition queries. Cross-partition queries will use more connections than a normal query. 20 clients in the machine might be putting some pressure on resources.

Is it possible to attach the rest of the diagnostics?

oshvartz commented 8 months ago

@ealsur we have in the process ~20 different cosmos accounts - is number of clients are for the same account? The issue is for document with only one partition key, and we pass it as part of the queue so not cross partition Is it possible to attach the rest of the diagnostics? - this was truncated in my logs, so this is what I have too. Update the same code with 3.31.2-preview didn't reproduced the error trying with 3.33.0 and I'll update

ealsur commented 8 months ago

with only one partition key,

Are you using PartitionKeyBuilder setting only 1 value? PartitionKeyBuilder is used for Hierarchical Partition Keys. Hierarchical Partition Keys when the number of values is less than the number of paths defined in the container (for example, the container is created with /country, /state, /city and you use PartitionKeyBuilder just specifying country), then it can be a cross partition query.

ealsur commented 8 months ago
["event": "Transit Time", "startTimeUtc": "2024-03-04T12:11:26.1339030Z", "durationInMs": 6771.9708],["event": "Failed", "startTimeUtc": "2024-03-04T12:11:32.9058738Z", "durationInMs": 0]],

This means the SDK sent the request and the response was not returned in 6700 milliseconds, so it timed out.

The VM is in East US

,"VM Region":"eastus"

And the Cosmos DB region is in South east asia

rntbd://cdb-ms-prod-southeastasia1-fd13-az01.documents.azure.com:14332

There isn't much more we can do to analyze this on the SDK end. Your machine is doing a cross-continental network request, and it's timing out. What networks are in the middle inbetween East US and South East Asia are public networks owned by unknown companies.

oshvartz commented 7 months ago

updating that it may be related #4360 since we were on the edge of dynamic port - I will update

microsoft-github-policy-service[bot] commented 7 months ago

@oshvartz this issue requires more information for the team to be able to help. In case this information is available, please add it and re-open the Issue.

oshvartz commented 6 months ago

@ealsur it seems i cannot reopen issue - but still wanted to complete the data on the issue. Sorry for the long delay in responding - we had mitigation downgrading to version 3.33.0 so we postponed investigating it. We thought that #4360 (dynamic ports) was related but we were able reproduced it in an isolated environment and saw it was still happening. We did binary search on the SDK version and saw it started from 3.35.0 so we reviewed the change log and saw #3902 - which is part of 3.35.0 the default timeout was reduced from 10 to 6. We increase the timeout back to 10 manually and the issue resolved. Our feedback was that the error indicated on connectivity issue - 20001 and not 20002 (client timeout) so we followed the relevant TSG which talked about ports\threads\high CPU in the client. I think that maybe 20002 sub status should be returned in this case. In addition, I looked for any explanation of why the default time was reduced to 6 seconds? How was this value selected ? Thanks

ealsur commented 6 months ago

Both 20001 and 20002 point to the same problem, it just happens at different times. 20001 is related to the inability to establish connections, 20002 is related to experiencing timeouts during requests: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/troubleshoot-service-unavailable. In your case, the attached diagnostics are truncated, I cannot see if the majority of the timeouts were due to connection or transit.

The TSG steps are the same for both, at least I don't see a difference in the document. Both can be caused by the same causes.

The previous timeout was 10 seconds, the new one is 6 seconds, But the truth is that a request should not take that long to complete. Having 7 seconds of latency on requests is not normal. The 6 seconds was changed to be inline with other SDKs (like Java) and because Cosmos DB has a maximum processing time of 5 seconds, assuming worst case scenario where the request took exactly 5 seconds on the service, it still has 1 full second of network time.

Are you normally working with more than 6 seconds of network latency and consider that normal behavior?

microsoft-github-policy-service[bot] commented 6 months ago

@oshvartz this issue requires more information for the team to be able to help. In case this information is available, please add it and re-open the Issue.