Open kushagraThapar opened 4 years ago
Hi, we are experiencing similar issues in a production environment. Not sure if this is the same, so if you think I should open a separate issue, I can do that.
For the connection we are using the direct connection mode and have no replication for the data (using only one region). In addition to the store response stats, we have a lot of these in the logs:
...
io.netty.handler.timeout.ReadTimeoutException: null
[ERROR] c.m.a.c.i.d.GatewayAddressCache - Network failure
com.microsoft.azure.cosmosdb.DocumentClientException: null
c.m.a.c.r.i.ClientRetryPolicy - Gateway endpoint not reachable. Will refresh cache and retry.
The issue here is that, even if the client eventually recovers, once we hit this issue, it causes > 1 minute hiccups. This is obviously experienced by the users.
The weird part is, we have the same setup in some staging/test environments, and none of these errors are manifesting there. One guess is that the load on the application could be heavier for the production, so some timeouts would result. But as far as I can see it seems it is not caused be CPU resource shortage. This on the other hand could imply memory pressure or some other resource contention causing a lot of CPU stalls. Have not had time to look closer.
-- As a side note, we are still using the V2 Java Async SDK, as last time we were about to transition to using V4 some of the necessary features were not yet implemented. Currently investigating if we could do the transition now, and could it fix the issues we are having.
Here is and excerpt from the production application logs: