Closed jamespavett closed 2 months ago
Hi @jamespavett, thank you for opening an issue! I'll tag some folks who can help; we'll get back to you as soon as possible.
Thank you for the detailed bug report @jamespavett . I had a few follow up questions for you:
max delivery count
, message time to live
, lock duration
) and any other features that are turned on. The clients don't set the value for lock time, its received from the service. Once we are able to repro on our side, we will have next steps.
Hi @jamespavett. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
@kashifkhan So I'll put a selection below of how the queue is configured, if you need anything else just let me know.
In .NET, originally I was setting a prefetchCount of 20. After removing the value and going back to the default, the locked until times seem to return as I would expect. I did try experimenting with the prefetch value in the python SDK but it never seemed to make a difference. I can share the .NET code if you like, but it does use pretty much an identical implementation.
I will give the autolock renewer a go to see if that mitigates anything, but I was getting the same issue before when manually trying to renew the locks.
@kashifkhan just tried with the autolock renewer, and didn't change the behavior at all.
thanks for the update @jamespavett. We will try and repro on our end and go from there.
qq:- The code to complete messages is commented out, I assume its just for the sake of the repro ?
@kashifkhan yea that was just commented out for the repo. Same behaviour can be observed on my end either way.
@jamespavett are you receiving large messages from your queue? If possible can you send us a sample message
@kashifkhan Unfortunately I can't send a sample, but they range from around 300 - 650 kb mostly. They are Symfony Envelopes for messages.
@kashifkhan done some more work surrounding this today. Message size seems to be a big factor, when processing messages that are around only a few kbs, everything seems to work as expected. But I seem to get this problem when processing larger messages.
@jamespavett We figured that was the case looking at your logs but repro has been evading us. Are you able to see if the issue still happens when you send in max_message_count = 1
Hi @jamespavett. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
Issue still persists after setting max_message_count to 1. To be fair I actually have tried a lot of configuration already surrounding this value and the prefetch_count.
LockedUntilUtc is set exactly when a message is locked in the queue. If 10 messages have the same LockedUntilUtc means they are all locked at the same instant. In this case, the receiver is prefetching messages when you first call receive or even before. Messages are locked when they are prefetched. When those messages are handed over to your application determines how much of lock duration is left. If a message is prefetched at instant x, but your application gets it via receive() call at x+10 seconds, you will see 10 seconds less LockedUntilUtc.
That's the reason you don't have this problem when you disabled prefetch in .net SDK. That's also the reason we suggest SDKs to default to 0 prefetch count.
I don't know how, but the SDK is prefetching messages in this case.
Hi,
I am experiencing exactly the same problem. I am sending quite large messages (ca. 1MB) onto a queue, and on another process, when listening with Azure SDK for python, I see exact same symptoms as explained above - time when lock is held is decreasing quickly and eventually my receiver is reporting problem as MessageLockLostError
.
await receiver.complete_message(msg)
File "/usr/local/lib/python3.10/site-packages/azure/servicebus/aio/_servicebus_receiver_async.py", line 852, in complete_message
await self._settle_message_with_retry(message, MESSAGE_COMPLETE)
File "/usr/local/lib/python3.10/site-packages/azure/servicebus/aio/_servicebus_receiver_async.py", line 494, in _settle_message_with_retry
raise MessageLockLostError(
azure.servicebus.exceptions.MessageLockLostError: The lock on the message lock has expired.
My setup:
Tried setting prefetch
to 0, but this does not help - actually it is a default value anyway.
@yvgopal your reasoning makes sense - question is why in this case SDK is prefetching multiple messages when all the parameters specify not to do so?
In our use case we planned to exchange messages up to 50MB, but the pilot implementation fails with 1MB making Azure Service Bus unsuitable for the job.
@jamespavett were you able to somehow solve this problem?
Hi @jarekhr , @jamespavett we have a PR out that we believe will help address this issue, it will be released in the next version of Service Bus
@l0lawrence , is the fix on the server side or in the client library? When do you expect the fix to be released to West Europe region? Thanks!
@jarekhr the fix is on the client library, so it will be available to everyone as soon as it lands on pypi. We will update this thread once that happens :)
@jarekhr @jamespavett The fix is now available in pypi
Thanks @kashifkhan , tested new version of SDK and problem indeed is fixed.
thats great @jarekhr, cc @l0lawrence
Describe the bug When receiving messages off of the service bus using the receiver using recieve_messages, inconsistent behaviors seem to be occurring surrounding the locked_until_utc field.
received_messages is being called inside a while loop retrieving messages in batches, for each iteration returning, I would expect the locked_until_utc to be relatively similar, and then the next batch should have locked_until_utc values slightly in the future, and so on, as locked_until_utc should be set when the messages are received after their retrieval in each iteration of the loop.
Instead of this, locked_until_utc barely seems to increase at all while going through the script, regardless of the number of messages processed, or how long the script has been running The locked_until_utc times almost seem to be locked to the time of the first message retrieval, or the calling of get_queue_receiver.
To Reproduce Basic code example to show the issue, I should also add that I have tried doing the same synchronously with the same result.
Expected behavior
locked_until_utc times should be set when the message is received from the queue, and I would expect the time to be set to the current time + lock duration of the queue. While this does not currently seem to be the case.
Screenshots
At the start of the script, there is about a minute gap between the Current Time and Locked Until Time. However, the same time is seen across different batches, which I would not expect to be the case, as this should be moving further into the future. Sometimes the Locked Until Time does move forward by a few ms, but not as the same rate as the retrieval time.
As the script progresses this gap gets smaller and smaller, even though each message is being returned in a batch of no greater than 10 messages.
Eventually, I started getting errors due to being unable to complete messages, as the messages I just retrieved were already past their locked_until_utc times.
Additional context
I did try and replicate this with the .NET SDK for the Azure Service Bus, and while I could replicate it in part, I could also get around the issue, something I was unable to do with the Python SDK.
Edit:
Add Logger Output:
Also attempted using uamqp transport and issue still occurs using that too.