Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.32k stars 1.97k forks source link

[BUG] RNTBD RntbdServiceEndpoint close exception failure. #14094

Closed moderakh closed 4 years ago

moderakh commented 4 years ago

The following has been reported and we verified that there is a singleton cosmos client involved:

2020-08-12T18:24:39.046706249Z java.lang.NullPointerException: null
2020-08-12T18:24:39.046712249Z at com.azure.cosmos.implementation.directconnectivity.rntbd.RntbdServiceEndpoint$Provider.access$100(RntbdServiceEndpoint.java:305) ~[azure-cosmos-4.3.0.jar!/:na]
2020-08-12T18:24:39.046718849Z at com.azure.cosmos.implementation.directconnectivity.rntbd.RntbdServiceEndpoint.close(RntbdServiceEndpoint.java:168) ~[azure-cosmos-4.3.0.jar!/:na]
2020-08-12T18:24:39.046725249Z at com.azure.cosmos.implementation.directconnectivity.rntbd.RntbdClientChannelPool.lambda$newTimeout$13(RntbdClientChannelPool.java:825) ~[azure-cosmos-4.3.0.jar!/:na]
2020-08-12T18:24:39.046731649Z at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672) ~[netty-common-4.1.50.Final.jar!/:4.1.50.Final]
2020-08-12T18:24:39.046737849Z at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747) [netty-common-4.1.50.Final.jar!/:4.1.50.Final]
2020-08-12T18:24:39.046744049Z at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472) [netty-common-4.1.50.Final.jar!/:4.1.50.Final]
2020-08-12T18:24:39.046750149Z at java.lang.Thread.run(Thread.java:748) [na:1.8.0_232]
2020-08-12T18:24:39.046755849Z
2020-08-12T18:24:39.831879224Z java.lang.IllegalStateException: RntbdServiceEndpoint({"id":1,"isClosed":true,"concurrentRequests":0,"remoteAddress":"cdb-ms-prod-centralus1-fd16.documents.azure.com:14454","channelPool":{"remoteAddress":"cdb-ms-prod-centralus1-fd16.documents.azure.com:14454","isClosed":false,"configuration":{"maxChannels":130,"maxRequestsPerChannel":30,"idleConnectionTimeout":0,"readDelayLimit":65000000000,"writeDelayLimit":10000000000},"state":{"channelsAcquired":0,"channelsAvailable":0,"requestQueueLength":0}}}) is closed
2020-08-12T18:24:39.831894524Z at com.azure.cosmos.implementation.guava25.base.Preconditions.checkState(Preconditions.java:586) ~[azure-cosmos-4.3.0.jar!/:na]
2020-08-12T18:24:39.831900024Z at com.azure.cosmos.implementation.directconnectivity.rntbd.RntbdServiceEndpoint.throwIfClosed(RntbdServiceEndpoint.java:222) ~[azure-cosmos-4.3.0.jar!/:na]
2020-08-12T18:24:39.831904624Z at com.azure.cosmos.implementation.directconnectivity.rntbd.RntbdServiceEndpoint.request(RntbdServiceEndpoint.java:175) ~[azure-cosmos-4.3.0.jar!/:na]
2020-08-12T18:24:39.831909224Z at com.azure.cosmos.implementation.directconnectivity.RntbdTransportClient.invokeStoreAsync(RntbdTransportClient.java:129) ~[azure-cosmos-4.3.0.jar!/:na]
2020-08-12T18:24:39.831913824Z at com.azure.cosmos.implementation.directconnectivity.TransportClient.invokeResourceOperationAsync(TransportClient.java:21) ~[azure-cosmos-4.3.0.jar!/:na]
2020-08-12T18:24:39.831919824Z at com.azure.cosmos.implementation.directconnectivity.ConsistencyWriter.lambda$writePrivateAsync$4(ConsistencyWriter.java:164) ~[azure-cosmos-4.3.0.jar!/:na]

when this happens latency increases till SDK retries multiple times and eventually it succeeds. It is also reported that sometimes the SDK doesn't recover and the app needs to be restarted.

https://portal.microsofticm.com/imp/v3/incidents/details/199732860/home

kushagraThapar commented 4 years ago

This has been fixed in v4.3.2-beta.1