Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
731 stars 487 forks source link

Increasing TCP connections using direct mode #4360

Closed oshvartz closed 5 months ago

oshvartz commented 5 months ago

Describe the bug We have a process connected to multiple cosmos accounts. We notice that the number of TCP connections is increasing. Restarting the process reduces only part of the connections but the number of connections is constantly increasing. We noticed that most of the connections were to the same IP address - "20.15.14.6" which we could not resolve host name but due to the ports this it uses we suspect it's one of the replicas of cosmos db. this is happening only in one of our environments in region EUS which has connection to accounts in regions such as South-Asia and Australia

Expected behavior Before the TCP connections count was stable increasing and decreasing in few tens but now it's only increasing and raise from ~6K to ~10K connections

Environment summary SDK Version: 3.33.0 (we tried upgrading to 3.38.1 without any impact) OS Version: windows server

Is there any metric\diagnostics we can enable to understand what the connection status inside the process by the SDK is

ealsur commented 5 months ago

Yes, the Diagnostics will tell you the established TCP connections and other stats such as the number of client instances. Capture them when you see the scenarios of high connection count.

oshvartz commented 5 months ago

@ealsur thanks for your response - can you help me identify the host of this ip 20.15.14.6 - or how can I know the ip address of my cosmos account replica ? as for Diagnostics I think it's only on the query response and we have change feed - this is what the process is doing - start subscribing on multiple accounts how can access Diagnostics in this case in addition, the connection is remaining high even after restart and it's 100 times more than the number of the accounts - does it make sense ?

ealsur commented 5 months ago

I confirmed the IP belongs to a Cosmos DB endpoint.

Can you share full Diagnostics? Based on the Direct connectivity design, the volume of connections is dependent on multiple factors: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/sdk-connection-modes#volume-of-connections

Within a single CosmosClient instance the stable number is 4 times the number of physical partitions in the Container. So, the more containers, the more connections (in general). The number of connections can also increase based on concurrent requests.

Now, if you have multiple clients, then this multiplies (that is why we recommend a Singleton per account). If your application interacts with 20 different accounts, then the Diagnostics show should 20 client instances, if it shows more, that means you are leaking clients, which will lead to more connections than expected. Otherwise, if the volume of clients match the expectation (== accounts), then the volume of connections depends on the workload and Containers. As per the above link, you can control it with things like IdleTcpConnectionTimeout for bursty scenarios where the volume of concurrent requests spikes, but in a constant workload, then those connections might not close because they are constantly being used. At that point, it's a matter of understanding if the architecture is correct, maybe Direct mode cannot be used with your current limits.

Diagnostics are in all operations. Queries return FeedResponse, it has it. Change Feed Pull model has FeedResponse. Change Feed Processor has the Context in the Handler that also has it.

oshvartz commented 5 months ago

@ealsur thanks for the detailed response I will extract diagnostic data using the processing context and I'll update I should have only one client per account but I will verify this and the largest collection I have has 10 physical partitions

oshvartz commented 5 months ago

@ealsur thanks for the detailed response I will extract diagnostic data using the processing context and I'll update I should have only one client per account, but I will verify this and the largest collection I have has 10 physical partitions so the numbers are not making sense - I will check

oshvartz commented 5 months ago

@ealsur we found the root cause and it was related to cosmos accounts with customer managed keys enabled (https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-setup-customer-managed-keys?tabs=azure-portal) The weird thing is that each cosmos client that was added (only using change feed) was adding ~150 tcp connections. We validate it by removing it one by one and looking at diagnostics NumberOfActiveClients\NumberOfClientsCreated decrease in 1 and tcp connection decrease in ~150. reading documentation, the volume of connection should be 4* number of physical partitions - which is 10 for each account so we would expect ~40 tcp connections - how can we explain ~150 connections? May it related to CMK?

ealsur commented 5 months ago

The article linked says that 4 * physical partitions is the "stable" state.

Each established connection can serve a configurable number of concurrent operations. If the volume of concurrent operations exceeds this threshold, new connections will be open to serve them, and it's possible that for a physical partition, the number of open connections exceeds the steady state number.

It really depends on the concurrency of requests. The article also says that you can control this with:

This behavior is expected for workloads that might have spikes in their operational volume. For the .NET SDK this configuration is set by CosmosClientOptions.MaxRequestsPerTcpConnection, and for the Java SDK you can customize using DirectConnectionConfig.setMaxRequestsPerConnection.

I don't know about CMK, are you using the Encryption Nuget packages?

oshvartz commented 5 months ago

@ealsur we just want to verify that ~200 tcp connection make sense - the account is CMK (this is supported using the cosmos sdk passing TokenCredential) and it's only using change feed - processor mode on one collection the collection has one logical partition but when I looked at the resource metrics I saw that is has 10 physical partitions. there is the lease container too (very small but as 10 physical partition - maybe this is the minimum) the change feed is just pulling so low concurrency IMO - so I get to ~80 tcp connections. Is there a way to undestand the number of connections in change feed? I want to verify we are not doing something wrong using the SDK

ealsur commented 5 months ago

Is the lease container using a different CosmosClient? Do you have any diagnostics that show the TCP connections stats?

oshvartz commented 5 months ago

@ealsur we just want to verify that ~200 tcp connection make sense - the account is CMK (this is supported using the cosmos sdk passing TokenCredential) and it's only using change feed - processor mode on one collection The collection has one logical partition but when I looked at the resource metrics, I saw that it has 10 physical partitions. there is the lease container too (very small but as 10 physical partition - maybe this is the minimum) the change feed is just pulling so low concurency IMO is there a way to undestand because other accounts takes only 40 connections in the same process change feed

oshvartz commented 5 months ago

@ealsur - after revisiting our code we notice that we set max RU to be 100K for all collections including lease collections and this is the reason for the 10 physical partitions - the collection is empty but still it opens connection replicate in each partition. I also notice we have several lease containers in the same process, and this explains how we reach this number of connections. Thanks for your help I learned a lot and we now have diagnostics in the change feed too which will help us troubleshoot issues in the future. We are considering moving to gateway mode to reduce the number of TCP connections in the cost of some degradation in performance - do you have any document that will help us estimate the increase in latency? closing this issue

ealsur commented 5 months ago

No, there is no way that I'm aware of to estimate which is the impacted latency if any (assuming the machine is in the same Azure region).