storePhysicalAddress is null - V2 Java Async SDK

kushagraThapar commented 4 years ago

storePhysicalAddress is null - V2 Java Async SDK
Example diagnostics from one of the customers:

RequestStartTime: "24 Aug 2020 13:09:05.284", RequestEndTime: "24 Aug 2020 13:09:06.299", Duration: 1015 ms, Number of regions attempted: 1
StoreResponseStatistics{requestResponseTime="24 Aug 2020 13:09:06.286", storeResult=storePhysicalAddress: null, lsn: -1, globalCommittedLsn: -1, partitionKeyRangeId: null, isValid: false, statusCode: 410, subStatusCode: 0, isGone: true, isNotFound: false, isInvalidPartition: false, requestCharge: 0.0, itemLSN: -1, sessionToken: null, exception: failed to establish connection to cdb-ms-prod-westus1-fd4.documents.azure.com:14001: connection timed out: cdb-ms-prod-westus1-fd4.documents.azure.com/40.112.249.60:14001, requestResourceType=Document, requestOperationType=Read}
StoreResponseStatistics{requestResponseTime="24 Aug 2020 13:09:06.299", storeResult=storePhysicalAddress: rntbd://cdb-ms-prod-westus1-fd4.documents.azure.com:14350/apps/28ad6635-acc0-4a33-8cbf-513f2a7ecff0/services/44c8c0ce-5154-49ca-918a-3160e5be949d/partitions/1316e464-6646-4890-8e75-1f564230ce92/replicas/132401551806471291p/, lsn: 3024626, globalCommittedLsn: 3438427, partitionKeyRangeId: 251, isValid: true, statusCode: 200, subStatusCode: 0, isGone: false, isNotFound: false, isInvalidPartition: false, requestCharge: 1.14, itemLSN: 644636, sessionToken: 18#3438428#5=-1#7=1595052, exception: null, requestResourceType=Document, requestOperationType=Read}
AddressResolutionStatistics{startTime="24 Aug 2020 13:09:06.299", endTime="31 Dec +999999999 23:59:59.999", targetEndpoint='https://prod-membership-core-westus.documents.azure.com:443/addresses/?$resolveFor=dbs%2FLs1aAA%3D%3D%2Fcolls%2FLs1aAJPsrkg%3D%2Fdocs&$filter=protocol%20eq%20rntbd&$partitionKeyRangeIds=251'}

henrikp-visma commented 3 years ago

Hi, we are experiencing similar issues in a production environment. Not sure if this is the same, so if you think I should open a separate issue, I can do that.

For the connection we are using the direct connection mode and have no replication for the data (using only one region). In addition to the store response stats, we have a lot of these in the logs:

...
io.netty.handler.timeout.ReadTimeoutException: null 
[ERROR] c.m.a.c.i.d.GatewayAddressCache - Network failure 
com.microsoft.azure.cosmosdb.DocumentClientException: null

c.m.a.c.r.i.ClientRetryPolicy - Gateway endpoint not reachable. Will refresh cache and retry.

The issue here is that, even if the client eventually recovers, once we hit this issue, it causes > 1 minute hiccups. This is obviously experienced by the users.

The weird part is, we have the same setup in some staging/test environments, and none of these errors are manifesting there. One guess is that the load on the application could be heavier for the production, so some timeouts would result. But as far as I can see it seems it is not caused be CPU resource shortage. This on the other hand could imply memory pressure or some other resource contention causing a lot of CPU stalls. Have not had time to look closer.

-- As a side note, we are still using the V2 Java Async SDK, as last time we were about to transition to using V4 some of the necessary features were not yet implemented. Currently investigating if we could do the transition now, and could it fix the issues we are having.

henrikp-visma commented 3 years ago

Here is and excerpt from the production application logs:

2020-11-26-log-excerpt.txt

Azure / azure-cosmosdb-java

storePhysicalAddress is null - V2 Java Async SDK #353