Azure / azure-documentdb-java

Java Sync SDK for SQL API of Azure Cosmos DB
MIT License
47 stars 48 forks source link

NullPointerException when calling readDatabase, need retry logic #91

Open zhang-hua opened 6 years ago

zhang-hua commented 6 years ago
[warn] c.m.a.d.DocumentClient - Failed to retrieve database account information. java.net.UnknownHostException: xxxxxxxxx.documents.azure.com: Try again
...
Error while getting DocumentDb database
java.lang.NullPointerException: null
    at com.microsoft.azure.documentdb.internal.BaseDatabaseAccountConfigurationProvider.getMaxReplicaSetSize(BaseDatabaseAccountConfigurationProvider.javdockersdockea:33)
    at com.microsoft.azure.documentdb.internal.directconnectivity.ConsistencyReader.read(ConsistencyReader.java:44)
    at com.microsoft.azure.documentdb.internal.directconnectivity.ReplicatedResourceClient.invoke(ReplicatedResourceClient.java:59)
    at com.microsoft.azure.documentdb.internal.directconnectivity.ServerStoreModel$1.apply(ServerStoreModel.java:84)
    at com.microsoft.azure.documentdb.internal.RetryUtility.executeStoreClientRequest(RetryUtility.java:113)
    at com.microsoft.azure.documentdb.internal.directconnectivity.ServerStoreModel.processMessage(ServerStoreModel.java:89)
    at com.microsoft.azure.documentdb.DocumentClient$8.apply(DocumentClient.java:2980)
    at com.microsoft.azure.documentdb.internal.RetryUtility.executeDocumentClientRequest(RetryUtility.java:58)
    at com.microsoft.azure.documentdb.DocumentClient.doRead(DocumentClient.java:2986)
    at com.microsoft.azure.documentdb.DocumentClient.readDatabase(DocumentClient.java:490)
zhang-hua commented 6 years ago

This issue can be fixed by adding some retry logic in getDatabaseAccountFromEndpoint() method of com/microsoft/azure/documentdb/DocumentClient.java :

DatabaseAccount getDatabaseAccountFromEndpoint(URI endpoint) throws DocumentClientException {
        DocumentServiceRequest request = DocumentServiceRequest.create(OperationType.Read, ResourceType.DatabaseAccount, "", null);
        this.putMoreContentIntoDocumentServiceRequest(request, HttpConstants.HttpMethods.GET);

        DocumentServiceResponse response = null;
        int retryCount = 32;
        while (--retryCount > 0) {
            try {
                request.setEndpointOverride(endpoint);
                response = this.gatewayProxy.doRead(request);
                break;
            } catch (IllegalStateException e) {
                // Ignore all errors. Discover is an optimization.
                String message = "Failed to retrieve database account information. %s";
                Throwable cause = e.getCause();
                if (cause != null) {
                    message = String.format(message, cause.toString());
                } else {
                    message = String.format(message, e.toString());
                }

                logger.warn(message);

                // catch UnknownHostException and retry
                if (cause instanceof UnknownHostException) {
                    try {
                        Thread.sleep(5000);
                    } catch (InterruptedException e1) {
                        e1.printStackTrace();
                        break;
                    }
                } else {
                    break;
                }
            }
        }

        if (response != null) {
            return response.getResource(DatabaseAccount.class);
        } else {
            return null;
        }
    }
moderakh commented 6 years ago

@srinathnarayanan can you please take a look at this one?

srinathnarayanan commented 6 years ago

Hi @zhang-hua I couldn't repro this issue. Are you still running into it? Is this is directMode or gateway?

zhang-hua commented 6 years ago

@srinathnarayanan , This only happened just after the DocumentDB account is deployed in a few minutes. I also tried to switch between direct and gateway mode but it doesn't work. Before NullPointerException thrown, there is an UnknownHostException which indicated the route to DocumentDB service is not reachable for a while at the beginning. The DocumentDB client of Java SDK doesn't have any retrying logic underline and fail very quickly with NullPointerException caused by this UnknownHostException. It happens very often in our IOT PCS2 solution because of we implemented a micro-service to initialize the DocumentDB database once booted after the account is successfully deployed by ARM. Please feel free to ping me if you need more details.

vkumarsharma commented 6 years ago

This is issue is hurting us for a long time now on our production server. This issue is very frequent on our production server. Even when we put the DNS address of the cosmosdb end point in out etc/hosts in our VM.

All our production environment is on Azure and we have logged multiple tickets with microsoft but they haven't been able to help. And then I find this thread. Please help with this. I will be able to do a screenshare if needed.

Can it be related to cosmosdb throttling the requests when the load is high. Because this issue comes to us randomly and corrects itself after few hours.

srinathnarayanan commented 6 years ago

@zhang-hua @vkumarsharma which OS are you using when you run into this issue?

zhang-hua commented 6 years ago

@srinathnarayanan, this issue occurs in a Linux container.

vkumarsharma commented 6 years ago

For us it is Ubuntu VM with tomcat8, no container

zhang-hua commented 6 years ago

hey, @srinathnarayanan, do you need any more info for this issue or have any update on this? Thanks!

srinathnarayanan commented 6 years ago

Hi @zhang-hua we are actively working on this and will add the retries to our next release. Thanks!

zhang-hua commented 6 years ago

@srinathnarayanan , very appreciated! :-)

kirankumarkolli commented 6 years ago

@zhang-hua just to confirm after ARM confirmed that cosmos account is created is when you getting this error on that account right? Could you please share the time when this happened, account. Ideally post account creation this failure should not happen.

zhang-hua commented 6 years ago

@kirankumarkolli , correct. Once the cosmos account and Kubernetes cluster have been created by ARM template in a few minutes, a Kubernetes template will be deployed on the cluster and spin up service to create a new database. The NullPointerException happened just after account creation for a couple of seconds caused by UnknownHostException which indicates the underline network route infrastructure is not totally ready to serve db connection at that time. Multiple retries to create any new DocumentClient doesn't work because of the internal state is not correct now and can not recover even the network is available to serve. For example, when this error happened, restarting the service process in the container will work. So I'm guessing the DocumentClient entered an zombie state which is not recoverable except restarting. By add the retrying logic to catch UnknownHostException above won't let the DocumentClient blow up.

ppathan commented 6 years ago

@srinathnarayanan @kirankumarkolli Hey guys any update on this? When is your next release?