Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
724 stars 481 forks source link

CosmosDB v3 SDK spawning infinite number of threads, causing threadpool starvation issue. #1852

Closed ericsuhong closed 2 years ago

ericsuhong commented 3 years ago

We are continuously addressing and improving the SDK, if possible, make sure the problem persist in the latest SDK version.

Describe the bug From time to time, we have discovered that few of our service instances become unresponsive and does not process any more requests.

From further investigation, we found out that threadpool queue length starts to grow infinite for this affected instance: image

We took dotnet trace dump and found out that CosmosDB v3 SDK is spawning infinite number of threads behind the scene, causing threadpool starvation issue: image

It seems like Microsoft.Azure.Documents.Rntbd.Dispather.OnIdleTimer method starts to spawn infinite number of threads under some race condition.

To Reproduce This issue doesn't occur always, so it is difficult to find out exactly when this happens. However, I attached sample trace file which can be opened by PerfView. cosmosdbsdk-trace.zip

Expected behavior CosmosDB SDK should not spawn infinite number of threads and cause threadpool starvation problem.

Actual behavior CosmosDB SDK spawns infinite number of threads intermittently and causes services to become unresponsive.

Environment summary SDK Version: 3.12.0 OS Version Linux (Ubuntu 16.04), running .Net Core 3.1 in Kubernetes

Additional context Our CosmosClientOption:

private CosmosClientOptions ConnectionPolicyFromClientSettings => new CosmosClientOptions
        {
            RequestTimeout = TimeSpan.FromSeconds(60),
            MaxRetryAttemptsOnRateLimitedRequests = 1,
            MaxRetryWaitTimeOnRateLimitedRequests = TimeSpan.FromSeconds(60),
            ConnectionMode = ConnectionMode.Direct,
            PortReuseMode = PortReuseMode.PrivatePortPool,
            IdleTcpConnectionTimeout = TimeSpan.FromMinutes(20),
            HttpClientFactory = this._httpClientFactory
        };

We also attached dotnet trace file.

ericsuhong commented 3 years ago

This issue seems to reprod 100% when there is a sudden stop in incoming traffic. I am pretty sure that this issue is being caused by IdleTcpConnectionTimeout because issue happens exactly after 20 minutes from traffic stop: tp

However, we also have seen a case where it recovers automatically exactly after 20 minutes as well: tp2

j82w commented 3 years ago

@ericsuhong how large is the databases and/or containers the client is connecting to? I'm want to figure out how many TCP connections the client has open.

ericsuhong commented 3 years ago

This is a very large database with ~10000 physical partitions. All our queries are cross-partition queries and our traffic is very large, so I can only assume that it maintains a huge number of TCP connections at any time.

TimPosey2 commented 3 years ago

All our queries are cross-partition queries

Is it possible to re-model (even if it requires you to duplicate) your data to reduce cross-partition querying?

ghost commented 2 years ago

Closing due to in-activity, pease feel free to re-open.