Particular / NServiceBus.Transport.AzureServiceBus

Azure Service Bus transport
Other
22 stars 19 forks source link

ASB transport using 'SendsAtomicWithReceive' mode cannot forward messages to the error queue when handler execution exceeds message lock duration #1043 #1053

Open soujay opened 2 months ago

soujay commented 2 months ago

Describe the bug

Description

When using the SendsAtomicWithReceive transaction mode and the handler execution time exceeds the message lock duration and renewal, the recoverability process cannot be executed properly. This is because during the recoverability process, a copy of the message is created and sent to the error queue, while the original message needs to be dequeued. However, if the message lock duration has expired, the original message cannot be removed from the queue because it has already been made available to other receivers for processing by the broker. As a result, the recoverability process gets stuck in an infinite loop, as the handler is unable to process the message before the lock duration expires.

Expected behavior

If the message handler always exceeds the message lock duration then the message should be moved to the error queue by the recoverability process.

Actual behavior

The message processing goes into infinite loop and the original message is not removed from the input queue while the error queue begins to fill up with the error message.

Steps to reproduce

  1. Create a handler takes more than 5 minutes to complete (The default ASB message lock duration is 5 minutes)
  2. Send a message to that handler
  3. The endpoint will attempt to process the message, but after 5 minutes the message becomes visible again. At that point the message lock expires.
  4. After the first processing attempt is complete, the transport will try to CompleteMessageAsync the message but a ServiceBusException will be raised with the reason being MessageLockLost leaving the message in the input queue.
  5. In the meantime, another thread is going to pick up the message that is now visible (after the lock duration has elapsed).
  6. This continues forever and the message will never be removed from the queue.
  7. The log will say that the delayed retries will be scheduled, but because the delayed retry messages cannot be sent once the lock has expired the configured delayed retry policy will never be executed meaning the message will stay in the input queue forever and no delayed messages will occur, and no message will be sent to the error queue either.

Relevant log output

WARN  Skip handling the message with id '{message ID}' because the lock has expired at '{time}'. This is usually an indication that the endpoint prefetches more messages than it is able to handle within the configured peek lock duration.

Additional Information

In the SendsAtomicWithReceive transaction mode, any outgoing operations that are associated with processing the incoming message are rolled back if the incoming message is not successfully processed. Therefore, using the LRU cache, like with the ReceiveOnly transaction mode, is not feasible with the SendsAtomicWithReceive transaction mode, as the handler never gets properly executed. In the ReceiveOnly Transaction mode if a message Id is found in the LRU cache, that indicates the message has already been handled, and any outgoing operations have already been executed and the message can be removed from the queue without having to invoke the message handler.

Workarounds

Increase lock-renewal to be greater than the duration of the handler multiplied by the prefetch count.