Azure / azure-service-bus-java

☁️ Java client library for Azure Service Bus
https://azure.microsoft.com/services/service-bus
MIT License
60 stars 59 forks source link

Periods of prolonged inactivity and frequent MessageLockLostException in QueueClient #404

Closed highbrowansala closed 4 years ago

highbrowansala commented 4 years ago

Hello everyone,

Background

We have a data transfer solution with Azure Service Bus as the message broker. We are transferring data from x datasets through x queues - with x dedicated QueueClients as senders. Some senders publish messages at the rate of one message every two seconds, while others publish one every 15 minutes.

The application on the data source side (where senders are) is working just fine, giving us the desired throughput.

On the other side, we have an application with one QueueClient receiver per queue with the following configuration:

The MessageHandler registered with each of these receivers does the following:

public CompletableFuture<Void> onMessageAsync(final IMessage message) {

    // deserialize the message body
    final CustomObject customObject = (CustomObject)SerializationUtils.deserialize((byte[])message.getMessageBody().getBinaryData().get(0));

    // process processDB1() and processDB2() asynchronously
    final List<CompletableFuture<Boolean>> processFutures = new ArrayList<CompletableFuture<Boolean>>();

    processFutures.add(processDB1(customObject));  // processDB1() returns Boolean
    processFutures.add(processDB2(customObject)); // processDB2() returns Boolean

    // join both the completablefutures to get the result Booleans
    List<Boolean> results = CompletableFuture.allOf(processFutures.toArray(new CompletableFuture[processFutures.size()])).thenApply(future -> processFutures.stream()
        .map(CompletableFuture<Boolean>::join).collect(Collectors.toList())

    if (results.contains(false)) {
        // dead-letter the message if results contains false
        return getQueueClient().deadLetterAsync(message.getLockToken());
    } else {
        // complete the message otherwise
        getQueueClient().completeAsync(message.getLockToken());
    }
}

We tested with the following scenarios:

Scenario 1 - receive mode = RECEIVEANDDELETE, message publish rate: 30/ minute

Expected Behavior

The messages should be received continuosuly with a constant throughput (which need not necessarily be the throughput at source, where messages are published).

Actual behavior

We observe random, long periods of inactivity from the QueueClient - ranging from minutes to hours - there is no Outgoing Messages from the Service Bus namespace (observed on the Metrics charts) and there are no consumption logs for the same time periods!

Scenario 2 - receive mode = PEEKLOCK, message publish rate: 30/ minute

Expected Behavior

The messages should be received continuosuly with a constant throughput (which need not necessarily be the throughput at source, where messages are published).

Actual behavior

We keep seeing MessageLockLostException constantly after 20-30 minutes into the run of the application.

We tried doing the following -

  1. we reduced the prefetch count (from 20 * processing rate - as mentioned in the Best Practices guide) to a bare minimum (to even 0 in one test cycle), to reduce the no. of messages that are locked for the client
  2. increased the maxAutoRenewDuration to 5 minutes - our processDB1() and processDB2() do not take more than a second or two for almost 90% of the cases - so, I think the lock duration of 30 seconds and maxAutoRenewDuration are not issues here.
  3. removed the blocking CompletableFuture.get() and made the processing synchronous.

None of these tweaks helped us fix the issue. What we observed is that the COMPLETE or RENEWMESSAGELOCK are throwing the MessageLockLostException.

We need help with finding answers for the following:

  1. why is there a long period of inactivity of the QueueClient in scenario 1?
  2. how do we know the MessageLockLostExceptions are thrown, because the locks have indeed expired? we suspect the locks cannot expire too soon, as our processing happens in a second or two. disabling prefetch also did not solve this for us.

Versions and Service Bus details

Java - openjdk-11-jre Azure Service Bus namespace: sbrps-geoshield-qa (tier: Standard) Entity: 5d639f10-c4de-11ea-87d0-0242ac130003_avl_currentpositions testing time window: 2020-07-20 10:30 PM till 2020-07-21 8:30 AM (IST - UTC + 5:30) Java SDK version - 3.4.0

highbrowansala commented 4 years ago

The issue was not with the QueueClient object per se. It was with the processes that we were triggering from within the MessageHandler: processDB1(customObject) and processDB2(customObject). since these processes were not optimized, the message consumption dropped and the locks gor expired (in peek-lock mode), as the handler was spending more time (in relation to the rate at which messages were published to the queues) in completing these opertations.

After optimizing the processes, the consumption and completion (in peek-lock) mode were just fine.