LoadBalancingPartition: Clean-Up Unhealthy Dangling Connections.

Description

Scope: Applicable with and without the "Advanced Replica Selection" feature.

Problem: One or our recent conversations with the IC3 team has helped us unveiling a potential issue with the RNTBD connection creation and management flow in the LoadBalancingPartition. The team is using the .NET SDK version 3.39.0-preview to connect to cosmos db. Recently they encountered a potential memory leak in some of their clusters and upon investigation, it appeared that one of the root cause is the underlying CosmosClient is keeping a high number of unhealthy and unused LbChannelStates.

In a nut shell, below are few of the account configurations and facts:

• account-1: 9 Partitions with 1 unique tenant. There are approx 4 to 8 clients for this tenant. 2 * no of replica regions is the client count. They have the connection warm-up enabled on this account.

• account-2 2592 Partitions with 249 tenants/ feds. Connections created in happy path scenario: 249 x Y (Y = number of active clients for that account). Connection warm-up disabled on this account.

• account-3: 27 Partitions with 13 tenants/ feds. CreateAndInitialize They have the connection warm-up enabled on this account.

To understand this in more detail, please take a look at the memory dump below.

ic3_memory_dump_accounts_hidden

[Fig-1: The above figure shows a snapshot of the memory dump taken for multiple accounts This also unviels the potential memory leak by the unhealthy connection.]

Upon further analysis of the memory dump, it is clear that:

The number of stale unhealthy connections are higher in the accounts where we have the replica validation enabled along with the connection warm-up.
Without the connection warm-up, the number of stale unhealthy connections are comparatively lower, but still good enough to increase the memory foot-print.

[Fig-2: The above figure shows how the memory footprint got increased over time, along with incoming requests. The service was finally needed to be restarted to free up the memory]
Even without the replica validation feature, the memory footprint showed a consistent increase over time.

[Fig-3: The above figure shows the memory consumption from the IC3 partner-api service which is using an older version (v 3.25.0) of the .NET SDK and the memory consumption kept increasing with time.]

Analysis : Upon further digging in to the memory dump, and re-producing the scenario locally, it was noted that:

With Replica Validation Enabled: Each of the impacted LoadBalancingPartition was holding more than 1 Unhealthy stale LbChannelState (which is a wrapper around the Dispatcher and a Channel), when the connection to the backend replica was closed deterministically.
With Replica Validation Disabled: Each of the impacted LoadBalancingPartition was holding exactly 1 Unhealthy stale LbChannelState (which is a wrapper around the Dispatcher and a Channel), when the connection to the backend replica was closed deterministically.

Let's take a look at the below diagram to understand this in more detail:

[Fig-4: The above figure shows an instance of the LoadBalancingPartition holding more than one entry of unhealthy LbChannelState.]

By looking at the above memory dump snapshot, it is clear that these stale LbChannelState entries are kept in the LoadBalancingPartition, until they are removed the openChannels list, which is responsible for maintaining the number of channels (healthy or unhealthy) for that particular endpoint. If they are not cleaned up proactively (which is exactly this case), it might end up claiming extra memory overhead. With increasing number of partitions, connections over the time, things get worse with all these unused, yet low hanging LbChannelStates claiming more and more memory, and causing it to a memory leak. This is the potential root cause of the increased memory consumption.

Proposed Solution :

There are few changes proposed to fix this scenario. These are discussed briefly in the below section:

During the replica validation phase, in the OpenConnectionAsync(), proactively remove all the Unhealthy connections from the openChannels within the LoadBalancingPartition. This guarantees that any unhealthy LbChannelStates will be removed from the LoadBalancingPartition, freeing up the additional memory.
Yet to be identified: Figure out ways to avoid opening duplicate connections to the same endpoint, multiple times.

FAQs:

Is this applicable only for the newer version of the SDKs?
Ans: No, the scenario can happen with the older version of the SDK too. As discussed in the above sections, the root cause of this problem is with the connection management and clean-up. It is observed from the memory dump that with older version of the SDKs, each impacted LoadBalancingPartition holds exactly one stale connection. Thus, with increasing number of connections, the memory utilization can be increased due to these unused stale connection stying in the memory.
- Does enabling "Advanced Replica Selection" make the memory consumption Worse"?
  Ans: Yes, it does. The advanced replica selection feature is designed in such a way that it keeps track of any Unhealthy replica that had a connectivity issue, and temporarily quarantines it so that the incoming requests has higher chance of landing on to a Healthy replica. Additionally, the feature validates the Unhealthy replica by proactively opening a connection to check if the replica came back up. This helps dramatically to reduce the laency for read workloads, when a replica undergoes upgrades etc. However, these proactive open connections can potentially increase the number of stale connections, when that connection is closed from BE for idleness. This has a larger impact on increasing the memory footprint today.

Azure / azure-cosmos-dotnet-v3

LoadBalancingPartition: Clean-Up Unhealthy Dangling Connections. #4467

Description