Azure / azure-service-bus-dotnet

☁️ .NET Standard client library for Azure Service Bus
https://azure.microsoft.com/services/service-bus
Other
235 stars 120 forks source link

20 minute MessageHandler throws MessageLockLostException #645

Closed brettsam closed 5 years ago

brettsam commented 5 years ago

I’m on the WebJobs/Functions team and we’ve got a customer that’s getting a MessageLockLostException when running a 20-minute function. I was able to repro with a console app (code below). Some things to call out:

Is there anything obviously wrong with this configuration below that would cause this?

static void Main(string[] args)
{
    var receiver = new MessageReceiver("{snip}", "myqueue");
    receiver.PrefetchCount = 1;

    receiver.RegisterMessageHandler(DelaySome, new MessageHandlerOptions(LogException)
    {
        AutoComplete = true,
        MaxAutoRenewDuration = TimeSpan.FromMinutes(55)
    });

    Console.ReadLine();
}

private static async Task DelaySome(Message message, CancellationToken cancellationToken)
{
    Console.WriteLine($"[{DateTime.UtcNow}]: ---START MessageHandler---");
    await Task.Delay(20 * 60 * 1000);
    Console.WriteLine($"[{DateTime.UtcNow}]: ---END MessageHandler---");
}

public static Task LogException(ExceptionReceivedEventArgs args)
{
    Console.WriteLine(args.Exception.ToString());
    return Task.CompletedTask;
}

I wired up an EventListener to “Microsoft-Azure-ServiceBus” as well to capture those logs. They’re here.

[2/8/2019 11:13:41 PM] Creating MessageReceiver (Namespace 'brettsam-func.servicebus.windows.net'; Entity 'myqueue'; ReceiveMode 'PeekLock').
[2/8/2019 11:13:41 PM] MessageReceiver (Namespace 'brettsam-func.servicebus.windows.net'; Entity 'myqueue'; ClientId 'MessageReceiver1myqueue' created).
[2/8/2019 11:13:41 PM] MessageReceiver1myqueue: Register OnMessageHandler start: OnMessage Options: AutoComplete: True, AutoRenewLock: True, MaxConcurrentCalls: 1, AutoRenewTimeout: 3300
[2/8/2019 11:13:41 PM] MessageReceiver1myqueue: Register OnMessageHandler done.
[2/8/2019 11:13:41 PM] MessageReceiver1myqueue: ReceiveAsync start. MessageCount = 1
[2/8/2019 11:13:41 PM] Creating/Recreating New Link. ClientId: MessageReceiver1myqueue, IsSessionReceiver: False, SessionId: , IsRequestResponseLink: False, LinkError: .
[2/8/2019 11:13:41 PM] MessageReceiver1myqueue: AmqpReceiveLinkCreate started. IsRequestResponseLink: False,  EntityType: False, EntityPath:
[2/8/2019 11:13:41 PM] AmqpGetOrCreateConnection started.
[2/8/2019 11:13:41 PM] AmqpConnectionCreated: HostName: brettsam-func.servicebus.windows.net, ConnectionInfo: out-connection0, ConnectionState: Opened
[2/8/2019 11:13:41 PM] AmqpGetOrCreateConnection done. EntityPath: myqueue, ConnectionInfo: out-connection0, ConnectionState: Opened
[2/8/2019 11:13:41 PM] AmqpSendAuthenticanToken started. Address: sb://brettsam-func.servicebus.windows.net/myqueue, Audience: sb://brettsam-func.servicebus.windows.net/myqueue, Resource: sb://brettsam-func.servicebus.windows.net/myqueue, Claims: System.String[]
[2/8/2019 11:13:42 PM] AmqpSendAuthenticanToken done.
[2/8/2019 11:13:42 PM] MessageReceiver1myqueue: AmqpReceiveLinkCreate done.
[2/8/2019 11:14:41 PM] MessageReceiver1myqueue: ReceiveAsync done. Received '0' messages
[2/8/2019 11:14:41 PM] MessageReceiver1myqueue: MessageReceiverPump PumpTask done: Available Semaphore Count: 1
[2/8/2019 11:14:41 PM] MessageReceiver1myqueue: ReceiveAsync start. MessageCount = 1
[2/8/2019 11:15:28 PM] MessageReceiver1myqueue: ReceiveAsync done. Received '1' messages
[2/8/2019 11:15:28 PM] MessageReceiver1myqueue: MessageReceiverPump PumpTask Started: Message: SequenceNumber: 3659174697238877, Available Semaphore Count: 0
[2/8/2019 11:15:28 PM] MessageReceiver1myqueue: MessageReceiverPump DispatchTask start: Message: SequenceNumber: 3659174697238877
[2/8/2019 11:15:28 PM] MessageReceiver1myqueue: MessageReceiverPump UserCallback start: Message: SequenceNumber: 3659174697238877
[2/8/2019 11:15:28 PM]: ---START MessageHandler---
[2/8/2019 11:15:28 PM] MessageReceiver1myqueue: MessageReceiverPump RenewMessage start: Message: SequenceNumber: 3659174697238877, RenewAfterTimeInSeconds: 289
[2/8/2019 11:20:18 PM] MessageReceiver1myqueue: RenewLockAsync start. MessageCount = 1, LockToken = f19f4f93-4e3b-4831-a4d7-84ee25f9000d
[2/8/2019 11:20:18 PM] Creating/Recreating New Link. ClientId: MessageReceiver1myqueue, IsSessionReceiver: False, SessionId: , IsRequestResponseLink: True, LinkError: .
[2/8/2019 11:20:18 PM] MessageReceiver1myqueue: AmqpReceiveLinkCreate started. IsRequestResponseLink: True,  EntityType: True, EntityPath:
[2/8/2019 11:20:18 PM] AmqpGetOrCreateConnection started.
[2/8/2019 11:20:18 PM] AmqpGetOrCreateConnection done. EntityPath: myqueue/$management, ConnectionInfo: out-connection0, ConnectionState: Opened
[2/8/2019 11:20:18 PM] AmqpSendAuthenticanToken started. Address: sb://brettsam-func.servicebus.windows.net/myqueue/$management, Audience: sb://brettsam-func.servicebus.windows.net/myqueue/$management, Resource: sb://brettsam-func.servicebus.windows.net/myqueue/$management, Claims: System.String[]
[2/8/2019 11:20:18 PM] AmqpSendAuthenticanToken done.
[2/8/2019 11:20:18 PM] MessageReceiver1myqueue: AmqpReceiveLinkCreate done.
[2/8/2019 11:20:18 PM] MessageReceiver1myqueue: RenewLockAsync done.
[2/8/2019 11:20:18 PM] MessageReceiver1myqueue: MessageReceiverPump RenewMessage done: Message: SequenceNumber: 3659174697238877
[2/8/2019 11:20:18 PM] MessageReceiver1myqueue: MessageReceiverPump RenewMessage start: Message: SequenceNumber: 3659174697238877, RenewAfterTimeInSeconds: 289
[2/8/2019 11:25:08 PM] MessageReceiver1myqueue: RenewLockAsync start. MessageCount = 1, LockToken = f19f4f93-4e3b-4831-a4d7-84ee25f9000d
[2/8/2019 11:25:08 PM] MessageReceiver1myqueue: RenewLockAsync done.
[2/8/2019 11:25:08 PM] MessageReceiver1myqueue: MessageReceiverPump RenewMessage done: Message: SequenceNumber: 3659174697238877
[2/8/2019 11:25:08 PM] MessageReceiver1myqueue: MessageReceiverPump RenewMessage start: Message: SequenceNumber: 3659174697238877, RenewAfterTimeInSeconds: 289
[2/8/2019 11:29:58 PM] MessageReceiver1myqueue: RenewLockAsync start. MessageCount = 1, LockToken = f19f4f93-4e3b-4831-a4d7-84ee25f9000d
[2/8/2019 11:29:58 PM] MessageReceiver1myqueue: RenewLockAsync done.
[2/8/2019 11:29:58 PM] MessageReceiver1myqueue: MessageReceiverPump RenewMessage done: Message: SequenceNumber: 3659174697238877
[2/8/2019 11:29:58 PM] MessageReceiver1myqueue: MessageReceiverPump RenewMessage start: Message: SequenceNumber: 3659174697238877, RenewAfterTimeInSeconds: 289
[2/8/2019 11:34:48 PM] MessageReceiver1myqueue: RenewLockAsync start. MessageCount = 1, LockToken = f19f4f93-4e3b-4831-a4d7-84ee25f9000d
[2/8/2019 11:34:48 PM] MessageReceiver1myqueue: RenewLockAsync done.
[2/8/2019 11:34:48 PM] MessageReceiver1myqueue: MessageReceiverPump RenewMessage done: Message: SequenceNumber: 3659174697238877
[2/8/2019 11:34:48 PM] MessageReceiver1myqueue: MessageReceiverPump RenewMessage start: Message: SequenceNumber: 3659174697238877, RenewAfterTimeInSeconds: 289
[2/8/2019 11:35:28 PM]: ---END MessageHandler---
[2/8/2019 11:35:28 PM] MessageReceiver1myqueue: MessageReceiverPump UserCallback done: Message: SequenceNumber: 3659174697238877
[2/8/2019 11:35:28 PM] MessageReceiver1myqueue: CompleteAsync start. MessageCount = 1, LockTokens = <LockToken>f19f4f93-4e3b-4831-a4d7-84ee25f9000d</LockToken>
[2/8/2019 11:35:28 PM] Creating/Recreating New Link. ClientId: MessageReceiver1myqueue, IsSessionReceiver: False, SessionId: , IsRequestResponseLink: False, LinkError: .
[2/8/2019 11:35:28 PM] MessageReceiver1myqueue: AmqpReceiveLinkCreate started. IsRequestResponseLink: False,  EntityType: False, EntityPath:
[2/8/2019 11:35:28 PM] AmqpGetOrCreateConnection started.
[2/8/2019 11:35:28 PM] AmqpGetOrCreateConnection done. EntityPath: myqueue, ConnectionInfo: out-connection0, ConnectionState: Opened
[2/8/2019 11:35:28 PM] AmqpSendAuthenticanToken started. Address: sb://brettsam-func.servicebus.windows.net/myqueue, Audience: sb://brettsam-func.servicebus.windows.net/myqueue, Resource: sb://brettsam-func.servicebus.windows.net/myqueue, Claims: System.String[]
[2/8/2019 11:35:28 PM] AmqpSendAuthenticanToken done.
[2/8/2019 11:35:28 PM] MessageReceiver1myqueue: AmqpReceiveLinkCreate done.
[2/8/2019 11:35:29 PM] MessageReceiver1myqueue: CompleteAsync Exception: Microsoft.Azure.ServiceBus.MessageLockLostException: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue, or was received by a different receiver instance.
   at Microsoft.Azure.ServiceBus.Core.MessageReceiver.DisposeMessagesAsync(IEnumerable`1 lockTokens, Outcome outcome) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\Core\MessageReceiver.cs:line 1429
   at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\RetryPolicy.cs:line 85
   at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\RetryPolicy.cs:line 107
   at Microsoft.Azure.ServiceBus.Core.MessageReceiver.CompleteAsync(IEnumerable`1 lockTokens) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\Core\MessageReceiver.cs:line 527.
Microsoft.Azure.ServiceBus.MessageLockLostException: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue, or was received by a different receiver instance.
  at Microsoft.Azure.ServiceBus.Core.MessageReceiver.DisposeMessagesAsync(IEnumerable`1 lockTokens, Outcome outcome) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\Core\MessageReceiver.cs:line 1429
   at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\RetryPolicy.cs:line 85
   at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\RetryPolicy.cs:line 107
   at Microsoft.Azure.ServiceBus.Core.MessageReceiver.CompleteAsync(IEnumerable`1 lockTokens) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\Core\MessageReceiver.cs:line 527
   at Microsoft.Azure.ServiceBus.MessageReceivePump.CompleteMessageIfNeededAsync(Message message) in C:\source\azure-service-bus-dotnet\src\Microsoft.Azure.ServiceBus\MessageReceivePump.cs:line 216
[2/8/2019 11:35:29 PM] MessageReceiver1myqueue: ReceiveAsync start. MessageCount = 1
[2/8/2019 11:35:29 PM] MessageReceiver1myqueue: MessageReceiverPump DispatchTask done: Message: SequenceNumber: 3659174697238877, Current Semaphore Count: 0
[2/8/2019 11:36:29 PM] MessageReceiver1myqueue: ReceiveAsync done. Received '0' messages
[2/8/2019 11:36:29 PM] MessageReceiver1myqueue: MessageReceiverPump PumpTask done: Available Semaphore Count: 1
[2/8/2019 11:36:29 PM] MessageReceiver1myqueue: ReceiveAsync start. MessageCount = 1
[2/8/2019 11:37:29 PM] MessageReceiver1myqueue: ReceiveAsync done. Received '0' messages
[2/8/2019 11:37:29 PM] MessageReceiver1myqueue: MessageReceiverPump PumpTask done: Available Semaphore Count: 1
[2/8/2019 11:37:29 PM] MessageReceiver1myqueue: ReceiveAsync start. MessageCount = 1
SeanFeldman commented 5 years ago

@binzywu is there a max lock renewal duration on the server by any chance?

arun-binary-write commented 5 years ago

I have the similar issue,

I have lock duration set on my topic for 2 seconds, and set MaxAutoRenewDuration=4 seconds.

So I expect the the message lock to be renewed when the message handler takes more than 2 seconds, and I can this happening as i expected.

However every-time when it renews the lock it is setting the new lock token!.

Now, the message handler eventually completes the process and trying to call the CompleteAsync with the lock token it received initially , when this happen I get the messagelocklost exception.

Then I further investigated from service bus explorer, it shows different lock token than I initially received.

Is this how it is expected to behave ? in that case how do I know the current lock token to mark the message as completed. And what is the fix for it ?

SeanFeldman commented 5 years ago

@techiearun LockDuration of 2 seconds and MaxAutoRenewDuration of 4 seconds sounds a bit suspicious. How long is the message processing? Also, note that MaxAutoRenewDuration means that the maximum total time a message could be locked would be 4 seconds. And that's not guaranteed since auto-renewal is not a guaranteed operation.

arun-binary-write commented 5 years ago

@SeanFeldman The message usually gets processed under 2 seconds, 4 out of 50 messages is taking more than 2 seconds that is around 3 seconds.. so I had set the MaxAutoRenewDuration to 4 seconds with the hope that message lock will be renewed automatically for additional 2 seconds if the message handler didn't complete the process under 2 seconds.

Is this the wrong usage ?

SeanFeldman commented 5 years ago

MaxLockDuration can be set up to 5 minutes. MaxAutoRenewDuration is designed to allow an option to extend processing beyond 5 minutes (though I would be careful about it). I would set it up differently. I would set the MaxLockDuration to up-to 30 seconds and not utilize MaxAutoRenewDuration altogether. That way, you're guaranteed not to lose the lock while processing is taking place. And if it's truly under 30 seconds processing, you won't see LockLostExceptions. And if you do, then investigate what's going on.

arun-binary-write commented 5 years ago

@brettsam Thanks for that. I do have another processor that is running with following configuration

Message Lock Duration : 5 Minutes MaxAutoRenewDuration : 301 seconds. ( Based the documentation that states "This value should be greater than the longest message lock duration; for example, the LockDuration Property" )

Expected processing time of a message 3-4 minutes on normal circumstances, can go over 5 minutes at times. So I have set above configurations. Even in this case i get MessageLockLostException when some of message processing time exceeds 5 minutes. I investigated and found that message lock renewed with differnt lock token that causes MessageLockLostException. For me, this behaviour seems to be a bug.

SeanFeldman commented 5 years ago

MaxAutoRenewDuration of 301 seconds? That's 5 minutes and 1 second of total time. It's that what you want? MaxAutoRenewDuration should be the maximum duration your message should be locked. So if processing can take more than 5 minutes, find the longest duration and use that plus a few seconds.

I investigated and found that message lock renewed with differnt lock token that causes MessageLockLostException. For me, this behaviour seems to be a bug.

Could you link your repro to back this statement?

Matt-Westphal commented 5 years ago

Back to the original post, what is the best way to deal with AutoComplete = true, MaxRenewDuration = 55 minutes, the handler taking longer than 20 minutes and getting a MessageLockLostException? I'm experiencing the same thing in an application that we're working on.

SeanFeldman commented 5 years ago

@Matt-Westphal does it happen for every message that takes 20+ minutes to process?

Personally, I'm in a mindset that processing of 20 minutes should be implemented in a manner where a message is locked for that time. But that's my personal view.

vipinmathew2013 commented 5 years ago

Hello All, does anyone know if the prevoius versions than 3.40 has the same issue? Maybe older versions does not have the issue.

Matt-Westphal commented 5 years ago

@SeanFeldman yes it was happening for every message. The process is creating multiple VMs in Azure and takes between 20-30 minutes to complete. If it fails we want the message to remain in the queue so that the process can try again. Usually 2nd attempt works.

The code that we had is the same as issue 684. I then modified the code to use AutoComplete = true for the MessageHandler and the MaxAutoRenewDuration to 1 hour. The code was still throwing the MessageLockLostException, but then I found a couple of references "Lock duration" within the properties of the queue. Once I change that from 30 seconds to 5 minutes, the code now completes the message successfully when the process takes 20-30 minutes. It would be nice to set the lock duration to 20 minutes, but the UI isn't letting me enter anything after 5 minutes.

vinaysurya commented 5 years ago

Hello All, this is an issue on the Service side and only happens for entities with 'EnablePartitioning' set to true. We are in the process of fixing this and should be deployed to production clusters in our next update. In the meantime, if possible for your scenario, a potential workaround would be to create an entity with 'EnablePartitioning' set to false and you should not see this issue with that. I will update this thread as I have more information about when the fix may be deployed.

SeanFeldman commented 5 years ago

I will update this thread as I have more information about when the fix may be deployed.

@vinaysurya thank you so much for clarifying on the issue. As this is not an issue with the client, it would be great if the broker issue tracker, https://github.com/Azure/azure-service-bus/issues, would contain an issue with symptoms and workarounds. If I may suggest a format, something among the lines of this issue: https://github.com/Particular/NServiceBus.Transport.AzureServiceBus/issues/51. Then this issue could be closed pointing to the broker issue tracker item.

In the meantime, if possible for your scenario, a potential workaround would be to create an entity with 'EnablePartitioning' set to false and you should not see this issue with that.

Please note that deleting an entity and re-creating is not a viable workaround for those that cannot shut down production system for this change. Also, it negates the advise to keep entities partitioned to improve the HA story on the standard tier. A non-partitioned entity will eventually suffer from short outages when containers are moved or replaced, affecting production system much more than systems with partitioned entities.

SeanFeldman commented 5 years ago

Broker issue: https://github.com/Azure/azure-service-bus/issues/276

vinaysurya commented 5 years ago

Good suggestion Sean, created an issue https://github.com/Azure/azure-service-bus/issues/276 for Service.

Agree that it may not be possible to re-create entities for every scenario, we'll try to release fix as soon as we can. But wanted to mention it just in case there were people who could use it.