Closed jfurmankiewiczpros closed 3 years ago
We run another test, with prefetchCount = 0 and maxConcurrentCalls = 8 (the # of cores on the server) and we got less of these at first, but after about 30 minutes we started to see them regularly in the logs again
com.azure.messaging.servicebus.ServiceBusException: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue. Reference:cd18804a-8364-4c11-a321-48aba7c7c088, TrackingId:8871df5a000200060000094860763f60_G1_B45, SystemTracker:mq4az-gocd-pipeline:Topic:mq4az_perftest_standard|perm_default, Timestamp:2021-04-14T02:55:11, errorContext[NAMESPACE: mq4az-gocd-pipeline.servicebus.windows.net, PATH: mq4az_perftest_standard/subscriptions/perm_default, REFERENCE_ID: mq4az_perftest_standard/subscriptions/perm_default_86bf05_1618362207561, LINK_CREDIT: 0]
at com.azure.messaging.servicebus.ServiceBusReceiverAsyncClient.lambda$updateDisposition$43(ServiceBusReceiverAsyncClient.java:1155)
at reactor.core.publisher.Mono.lambda$onErrorMap$30(Mono.java:3384)
at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:94)
at reactor.core.publisher.Operators$MonoSubscriber.onError(Operators.java:1862)
considering the prefetch count of 0, something seems wrong here.
It looks as if the default lock duration applied by a processor is insufficient and there is no way to override it when building the processor client.
after about 45 min of running with prefetchCount = 0 and maxConcurrentCalls = 8, now our logs are flooded with the same lock exception. So it took longer then were running with prefetchCount / maxConcurrentCalls of 160, but now it's in the same state. Just logs overflowing with the same exception over and over.
Thanks for reporting this @jfurmankiewiczpros. @YijunXieMS could you please investigate?
/cc @hemanttanwar
hi all. I managed to capture the log of a full run of this to reproduce it. Attaching the log
Here's how the test was run:
It was all slowly chugging along till around 34K messages.
Suddenly it got an invalid lock exception on line 183053 of the attached log and after that that exception is just appearing over and over again, flooding the logs.
Hope this helps. Right now we are holding off with adopting the new SDK as we have not been able to complete a single successful test run with it, so it's a blocker for us.
also forgot to mention we set the following property on the processor, taking advantage of the addition of this API in 7.3.0.beta.1
.maxAutoLockRenewDuration(Duration.ofMinutes(5)) .receiveMode(ServiceBusReceiveMode.PEEK_LOCK) .prefetchCount(0) .maxConcurrentCalls(8)
hi, any updates on this? it is a critical showstopper for us, we cannot move ahead with adopting the SDK.
Hi @jfurmankiewiczpros I'm working on this. Target to release by 05/14. Does this meet your timeline?
I think we can live with that, if you can have some early alpha builds that we could test with earlier that would be greatly appreciated, thank you
@jfurmankiewiczpros, Appreciate you for providing the opportunity to help test the alpha build. Will keep you updated.
@jfurmankiewiczpros It's possible that the message is in the cache for too long. There is no auto lock renew yet when a message is in cache. Prefetch is 0 so this shouldn't happen in theory. I will do long running test. Meanwhile, is it possible you re-run the test with "debug" level logs and timestamp for log entries? That will give us more information.
let me see, this test suite is a simple Java class with main(), so logging is just basic console, will try to enable all of that and re-run the test
@jfurmankiewiczpros pls forget about the debug log. I already have it.
@jfurmankiewiczpros We released azure-messaging-servicebus 7.2.1. My long running test didn't have this problem any more. .maxAutoLockRenewDuration(Duration.ofMinutes(5)
isn't available yet in 7.2.1 because it's still in the beta version. But 7.2.1 has default autolock for 5 minutes. So I suggest you use 7.2.1.
Could you try it? Let me know if any more problems.
absolutely, let me try that tomorrow and let you know
@jfurmankiewiczpros have you tried it yet?
yes, THAT issue seems to go away, but I am still not able to successfully run my long running tests. Fetching messages 1 at a time is very slow, if I increase prefetchCount I start getting some management node errors after a few minutes and everything seems to grind to a halt. I will probably open a separate issue for that once we can isolate it more.
But I am OK to close this particular issue as solved, seems to have gone away with latest SDK
Thanks for the update. I'll look into that.
Describe the bug
We ported to the new SDK and are running some long-running perf tests to see how it is doing. Our logs are flooded with constant exceptions like the one below when we call message context complete()
We create the processor in a standard way:
Notice there is no option in the configuration to specify message lock. We presume it uses some built-in default we can't override.
Our actual subscribe processing logic is basically a no-op to verify we got the message and just logs a statement to a log, nothing else. So the messaging processing is basically instant.
And yet our logs are flooded with thousands of such exceptions that the lock is expired.
Setup (please complete the following information):