Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
743 stars 495 forks source link

Availability: Optimize EnableTcpConnectionEndpointRediscovery with nonblocking cache #3187

Open j82w opened 2 years ago

j82w commented 2 years ago

EnableTcpConnectionEndpointRediscovery causes an Address cache refresh when the TCP connection is closed.

https://github.com/Azure/azure-cosmos-dotnet-v3/blob/70b1b4a71216d4437229ae7a4f35b5a686c4950a/Microsoft.Azure.Cosmos/src/CosmosClientOptions.cs#L494

This uses the direct ConnectionStateListner: https://msdata.visualstudio.com/_git/CosmosDB?path=/Product/Microsoft.Azure.Documents/SharedFiles/ConnectionStateListener.cs

The ConnectionStateListner calls this UpdateAsync on the GlboalAddressResolver. The problem here is it does a TryRemove which will remove all the addresses from the cache. https://github.com/Azure/azure-cosmos-dotnet-v3/blob/70b1b4a71216d4437229ae7a4f35b5a686c4950a/Microsoft.Azure.Cosmos/src/Routing/GlobalAddressResolver.cs#L150

This is bad because it will block all requests to that partition until the address are retrieved from the gateway which can take multiple seconds if there are networking/gateway issues. If the new request fails then it has to be retried until there is a success. Ideally only the one address should have been marked as unhealthy and it trigger the cache refresh. That way the other 3 replicas can continue to process requests.

This can be done by following a similar design as this: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/70b1b4a71216d4437229ae7a4f35b5a686c4950a/Microsoft.Azure.Cosmos/src/Routing/GatewayAddressCache.cs#L199

j82w commented 2 years ago

One concern with this approach is it could cause throttling because of the number of refreshes that would occur at the same time.