Closed ericsuhong closed 2 years ago
This issue seems to reprod 100% when there is a sudden stop in incoming traffic. I am pretty sure that this issue is being caused by IdleTcpConnectionTimeout because issue happens exactly after 20 minutes from traffic stop:
However, we also have seen a case where it recovers automatically exactly after 20 minutes as well:
@ericsuhong how large is the databases and/or containers the client is connecting to? I'm want to figure out how many TCP connections the client has open.
This is a very large database with ~10000 physical partitions. All our queries are cross-partition queries and our traffic is very large, so I can only assume that it maintains a huge number of TCP connections at any time.
All our queries are cross-partition queries
Is it possible to re-model (even if it requires you to duplicate) your data to reduce cross-partition querying?
Closing due to in-activity, pease feel free to re-open.
Describe the bug From time to time, we have discovered that few of our service instances become unresponsive and does not process any more requests.
From further investigation, we found out that threadpool queue length starts to grow infinite for this affected instance:![image](https://user-images.githubusercontent.com/3857851/93275113-ddee4800-f770-11ea-9d6b-9b2ed7500a5f.png)
We took dotnet trace dump and found out that CosmosDB v3 SDK is spawning infinite number of threads behind the scene, causing threadpool starvation issue:![image](https://user-images.githubusercontent.com/3857851/93275086-cca53b80-f770-11ea-91f6-64efad3aae1e.png)
It seems like Microsoft.Azure.Documents.Rntbd.Dispather.OnIdleTimer method starts to spawn infinite number of threads under some race condition.
To Reproduce This issue doesn't occur always, so it is difficult to find out exactly when this happens. However, I attached sample trace file which can be opened by PerfView. cosmosdbsdk-trace.zip
Expected behavior CosmosDB SDK should not spawn infinite number of threads and cause threadpool starvation problem.
Actual behavior CosmosDB SDK spawns infinite number of threads intermittently and causes services to become unresponsive.
Environment summary SDK Version: 3.12.0 OS Version Linux (Ubuntu 16.04), running .Net Core 3.1 in Kubernetes
Additional context Our CosmosClientOption:
We also attached dotnet trace file.