Open zhang-hua opened 6 years ago
This issue can be fixed by adding some retry logic in getDatabaseAccountFromEndpoint()
method of com/microsoft/azure/documentdb/DocumentClient.java
:
DatabaseAccount getDatabaseAccountFromEndpoint(URI endpoint) throws DocumentClientException {
DocumentServiceRequest request = DocumentServiceRequest.create(OperationType.Read, ResourceType.DatabaseAccount, "", null);
this.putMoreContentIntoDocumentServiceRequest(request, HttpConstants.HttpMethods.GET);
DocumentServiceResponse response = null;
int retryCount = 32;
while (--retryCount > 0) {
try {
request.setEndpointOverride(endpoint);
response = this.gatewayProxy.doRead(request);
break;
} catch (IllegalStateException e) {
// Ignore all errors. Discover is an optimization.
String message = "Failed to retrieve database account information. %s";
Throwable cause = e.getCause();
if (cause != null) {
message = String.format(message, cause.toString());
} else {
message = String.format(message, e.toString());
}
logger.warn(message);
// catch UnknownHostException and retry
if (cause instanceof UnknownHostException) {
try {
Thread.sleep(5000);
} catch (InterruptedException e1) {
e1.printStackTrace();
break;
}
} else {
break;
}
}
}
if (response != null) {
return response.getResource(DatabaseAccount.class);
} else {
return null;
}
}
@srinathnarayanan can you please take a look at this one?
Hi @zhang-hua I couldn't repro this issue. Are you still running into it? Is this is directMode or gateway?
@srinathnarayanan , This only happened just after the DocumentDB account is deployed in a few minutes. I also tried to switch between direct and gateway mode but it doesn't work. Before NullPointerException thrown, there is an UnknownHostException
which indicated the route to DocumentDB service is not reachable for a while at the beginning. The DocumentDB client of Java SDK doesn't have any retrying logic underline and fail very quickly with NullPointerException caused by this UnknownHostException
. It happens very often in our IOT PCS2 solution because of we implemented a micro-service to initialize the DocumentDB database once booted after the account is successfully deployed by ARM. Please feel free to ping me if you need more details.
This is issue is hurting us for a long time now on our production server. This issue is very frequent on our production server. Even when we put the DNS address of the cosmosdb end point in out etc/hosts in our VM.
All our production environment is on Azure and we have logged multiple tickets with microsoft but they haven't been able to help. And then I find this thread. Please help with this. I will be able to do a screenshare if needed.
Can it be related to cosmosdb throttling the requests when the load is high. Because this issue comes to us randomly and corrects itself after few hours.
@zhang-hua @vkumarsharma which OS are you using when you run into this issue?
@srinathnarayanan, this issue occurs in a Linux container.
For us it is Ubuntu VM with tomcat8, no container
hey, @srinathnarayanan, do you need any more info for this issue or have any update on this? Thanks!
Hi @zhang-hua we are actively working on this and will add the retries to our next release. Thanks!
@srinathnarayanan , very appreciated! :-)
@zhang-hua just to confirm after ARM confirmed that cosmos account is created is when you getting this error on that account right? Could you please share the time when this happened, account. Ideally post account creation this failure should not happen.
@kirankumarkolli , correct. Once the cosmos account and Kubernetes cluster have been created by ARM template in a few minutes, a Kubernetes template will be deployed on the cluster and spin up service to create a new database. The NullPointerException
happened just after account creation for a couple of seconds caused by UnknownHostException
which indicates the underline network route infrastructure is not totally ready to serve db connection at that time. Multiple retries to create any new DocumentClient doesn't work because of the internal state is not correct now and can not recover even the network is available to serve. For example, when this error happened, restarting the service process in the container will work. So I'm guessing the DocumentClient entered an zombie state which is not recoverable except restarting. By add the retrying logic to catch UnknownHostException
above won't let the DocumentClient blow up.
@srinathnarayanan @kirankumarkolli Hey guys any update on this? When is your next release?