Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
723 stars 477 forks source link

LoadBalancingPartition: Clean-Up Unhealthy Dangling Connections. #4467

Open kundadebdatta opened 1 month ago

kundadebdatta commented 1 month ago

Description

Scope: Applicable with and without the "Advanced Replica Selection" feature.

Problem: One or our recent conversations with the IC3 team has helped us unveiling a potential issue with the RNTBD connection creation and management flow in the LoadBalancingPartition. The team is using the .NET SDK version 3.39.0-preview to connect to cosmos db. Recently they encountered a potential memory leak in some of their clusters and upon investigation, it appeared that one of the root cause is the underlying CosmosClient is keeping a high number of unhealthy and unused LbChannelStates.

In a nut shell, below are few of the account configurations and facts:

account-1: 9 Partitions with 1 unique tenant. There are approx 4 to 8 clients for this tenant. 2 * no of replica regions is the client count. They have the connection warm-up enabled on this account.

account-2 2592 Partitions with 249 tenants/ feds. Connections created in happy path scenario: 249 x Y (Y = number of active clients for that account). Connection warm-up disabled on this account.

account-3: 27 Partitions with 13 tenants/ feds. CreateAndInitialize They have the connection warm-up enabled on this account.

To understand this in more detail, please take a look at the memory dump below.

ic3_memory_dump_accounts_hidden

[Fig-1: The above figure shows a snapshot of the memory dump taken for multiple accounts This also unviels the potential memory leak by the unhealthy connection.]

Upon further analysis of the memory dump, it is clear that:

Analysis : Upon further digging in to the memory dump, and re-producing the scenario locally, it was noted that:

Let's take a look at the below diagram to understand this in more detail:

image

[Fig-4: The above figure shows an instance of the LoadBalancingPartition holding more than one entry of unhealthy LbChannelState.]

By looking at the above memory dump snapshot, it is clear that these stale LbChannelState entries are kept in the LoadBalancingPartition, until they are removed the openChannels list, which is responsible for maintaining the number of channels (healthy or unhealthy) for that particular endpoint. If they are not cleaned up proactively (which is exactly this case), it might end up claiming extra memory overhead. With increasing number of partitions, connections over the time, things get worse with all these unused, yet low hanging LbChannelStates claiming more and more memory, and causing it to a memory leak. This is the potential root cause of the increased memory consumption.

Proposed Solution :

There are few changes proposed to fix this scenario. These are discussed briefly in the below section:

FAQs: