Azure / azure-service-bus-java

☁️ Java client library for Azure Service Bus
https://azure.microsoft.com/services/service-bus
MIT License
60 stars 59 forks source link

TimeoutException: UpdateState request timed out in version 1.2.8 #312

Open peterbomers opened 5 years ago

peterbomers commented 5 years ago

Hay

I've recently updated our source to use the 1.2.8 version and I in our production environment we get strange issues. After a random number of days the subscription client starts producing TimeoutExceptions with message "UpdateState request timed out. Delivery:?????????".

The application processes the message successfully but the final call to the service bus to ack the message fails. This results in endless processing of messages.

I've been reading an existing issue #272 and implemented our application in the recommended way.

Registration

    private void receiveSubscriptionMessages() throws ServiceBusException, InterruptedException {
        final MessageHandlerOptions options = new MessageHandlerOptions( 10, false, Duration.ofMinutes( 5 ) );
        final ExecutorService executor = Executors.newFixedThreadPool( options.getMaxConcurrentCalls() );
        subscriptionClient.registerMessageHandler( new MessageHandler(), options, executor );
        log.info( "Registered to subscription" );
    }

Handler

    private class MessageHandler implements IMessageHandler {
        @Override
        public CompletableFuture<Void> onMessageAsync( final IMessage message ) {
            log.info( "New message {}", message.getMessageId() );
            final CompletableFuture<Void> future = new CompletableFuture<>();
            process( future, message );
            return future;
        }

        @Override
        public void notifyException( final Throwable exception, final ExceptionPhase phase ) {
            log.error( phase + " encountered exception:" + exception.getMessage(), exception );
        }
    }

Processor

    private void process( final CompletableFuture<Void> future, final IMessage message ) {
        try {
            try {
                 // logic
                future.complete( null );
            } catch ( final RetryableException e ) {
                future.completeExceptionally( e );
            } catch ( final ValidationException | NonRetryableException e ) {
                // can't handle message, send to deadletter queue
                log.error( "Poison message {}, sending to deadletter queue", message.getMessageId() );
                subscriptionClient.deadLetter( message.getLockToken() );
                future.complete( null );
            }
        } catch ( final InterruptedException | ServiceBusException e ) {
            // catch exceptions in service bus communication
            future.completeExceptionally( e );
        } catch ( final Exception e ) {
            // unhandled exception, should not occur
            future.completeExceptionally( e );
        }
    }

Any idea how to resolve this issue?

yvgopal commented 5 years ago

Need some more info.

  1. Do you set prefetch count on the subscription client?
  2. What do you mean by 'after random number of days'? Approximately one week or one month?
  3. What happens after you get this exception? Does your subscription client continue to receive and complete messages?
  4. Occasional time out is expected. But such time outs should not be continuous and should never be more than 1 in a thousand. Clients receive timeouts when service is being upgraded or in case of a network glitch or another random issue that causes a node reboot on the service.
  5. If the timeouts you get are continuous, or lot of them, please capture the traces. Java SDK supports SLF4J tracing, so you can hook up any of logging framework of your choice. Traces will help me figure out what is going on.
peterbomers commented 5 years ago

Need some more info.

  1. Do you set prefetch count on the subscription client?

No, the prefetch count is untouched. (zero by default)

  1. What do you mean by 'after random number of days'? Approximately one week or one month?

Sometimes the issue occurs after 1 or 2 days but at the moment it's running stable for 8 days. Highest uptime is around 21 days.

  1. What happens after you get this exception? Does your subscription client continue to receive and complete messages?

The message is received by the subscription client and the business logic runs as expected (logging each step). In the end the complete method on the subscription client is called and this triggers the exception.

The max delivery count is set to 10. Each message is handled 10 times (logging shows this as well) and then moves to the dead letter queue. (I guess that's initiated by the service bus) So the client continues receiving messages but can't complete them.

  1. Occasional time out is expected. But such time outs should not be continuous and should never be more than 1 in a thousand. Clients receive timeouts when service is being upgraded or in case of a network glitch or another random issue that causes a node reboot on the service.

Ok, good to know.

  1. If the timeouts you get are continuous, or lot of them, please capture the traces. Java SDK supports SLF4J tracing, so you can hook up any of logging framework of your choice. Traces will help me figure out what is going on.

Currently logging the exceptions in application insights. Its shows 3 exceptions for each failed message:

Message | Request timed out. 
Exception type | com.microsoft.azure.servicebus.primitives.TimeoutException |  
Failed method | com.microsoft.azure.servicebus.primitives.CoreMessageReceiver$1.run |  
Logger Message | UpdateState request timed out. Delivery:   ????A????
Message | Request timed out. 
Exception type | com.microsoft.azure.servicebus.primitives.TimeoutException 
Failed method | com.microsoft.azure.servicebus.primitives.CoreMessageReceiver$1.run
Logger Message | USERCALLBACK encountered exception:Request timed out.
Message | Request timed out.
Exception type | com.microsoft.azure.servicebus.primitives.TimeoutException |  
Failed method | com.microsoft.azure.servicebus.primitives.CoreMessageReceiver$1.run |  
Logger Message | onMessage with message containing sequence number '56294995342236104' threw exception

All with the same call stack

com.microsoft.azure.servicebus.primitives.TimeoutException:
   at com.microsoft.azure.servicebus.primitives.CoreMessageReceiver$1.run (CoreMessageReceiver.java164)
   at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java511)
   at java.util.concurrent.FutureTask.runAndReset (FutureTask.java308)
   at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301 (ScheduledThreadPoolExecutor.java180)
   at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java294)
   at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java1149)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java624)
   at java.lang.Thread.run (Thread.java748)
yvgopal commented 5 years ago

Ok. Your client seems to be running fine. Can you give me the namespace, entity name, and the sequence number of the message for which completions timed out 10 consecutive times? Also at least one timestamp (UTC) of the timeout exception? I will take a look at the logs on the service and see if something is wrong.

peterbomers commented 5 years ago

Ok. Your client seems to be running fine. Can you give me the namespace, entity name, and the sequence number of the message for which completions timed out 10 consecutive times? Also at least one timestamp (UTC) of the timeout exception? I will take a look at the logs on the service and see if something is wrong.

namespace: livestock-cloud-service-bus-prod entity: animal-end-of-day sequence number: 21110623253404171 exception timestamp: 09/12/2018, 07:17:20

yvgopal commented 5 years ago

I looked at the traces on the service. One of the front end nodes was experiencing high CPU at the time and some requests from your client to that node timed out. The message with the sequence number you provided couldn't be completed multiple times, but its lock was renewed few times. So some operations timed out and some succeeded. Unfortunately all timed out operations are 'complete' calls. After it was deadlettered, it was received on a connection established to a different node and completed immediately. So I attribute these failures to a high CPU. You should have noticed it on only a specific connection. By connection I mean a SubscriptionClient. Not all subscription clients should have experienced the same problem. We have since then rolled out a new build to production on the service, and these high CPU issues are fixed. If you experience this issue again, please do let me know or open a support ticket with Azure Service Bus. I can confirm that it is not a client side issue.

peterbomers commented 5 years ago

I looked at the traces on the service. One of the front end nodes was experiencing high CPU at the time and some requests from your client to that node timed out. The message with the sequence number you provided couldn't be completed multiple times, but its lock was renewed few times. So some operations timed out and some succeeded. Unfortunately all timed out operations are 'complete' calls. After it was deadlettered, it was received on a connection established to a different node and completed immediately. So I attribute these failures to a high CPU. You should have noticed it on only a specific connection. By connection I mean a SubscriptionClient. Not all subscription clients should have experienced the same problem. We have since then rolled out a new build to production on the service, and these high CPU issues are fixed. If you experience this issue again, please do let me know or open a support ticket with Azure Service Bus. I can confirm that it is not a client side issue.

Thank you for the reply & information! Checked the last couple of weeks (I was out of office for a while) and I saw the issue occurred ones again, yesterday at 04:53:17 with sequence number: 3096224743940795

peterbomers commented 5 years ago

I looked at the traces on the service. One of the front end nodes was experiencing high CPU at the time and some requests from your client to that node timed out. The message with the sequence number you provided couldn't be completed multiple times, but its lock was renewed few times. So some operations timed out and some succeeded. Unfortunately all timed out operations are 'complete' calls. After it was deadlettered, it was received on a connection established to a different node and completed immediately. So I attribute these failures to a high CPU. You should have noticed it on only a specific connection. By connection I mean a SubscriptionClient. Not all subscription clients should have experienced the same problem. We have since then rolled out a new build to production on the service, and these high CPU issues are fixed. If you experience this issue again, please do let me know or open a support ticket with Azure Service Bus. I can confirm that it is not a client side issue.

FYI this morning the issue reappeared. (after working for several months)