Open peterbomers opened 5 years ago
Need some more info.
Need some more info.
- Do you set prefetch count on the subscription client?
No, the prefetch count is untouched. (zero by default)
- What do you mean by 'after random number of days'? Approximately one week or one month?
Sometimes the issue occurs after 1 or 2 days but at the moment it's running stable for 8 days. Highest uptime is around 21 days.
- What happens after you get this exception? Does your subscription client continue to receive and complete messages?
The message is received by the subscription client and the business logic runs as expected (logging each step). In the end the complete method on the subscription client is called and this triggers the exception.
The max delivery count is set to 10. Each message is handled 10 times (logging shows this as well) and then moves to the dead letter queue. (I guess that's initiated by the service bus) So the client continues receiving messages but can't complete them.
- Occasional time out is expected. But such time outs should not be continuous and should never be more than 1 in a thousand. Clients receive timeouts when service is being upgraded or in case of a network glitch or another random issue that causes a node reboot on the service.
Ok, good to know.
- If the timeouts you get are continuous, or lot of them, please capture the traces. Java SDK supports SLF4J tracing, so you can hook up any of logging framework of your choice. Traces will help me figure out what is going on.
Currently logging the exceptions in application insights. Its shows 3 exceptions for each failed message:
Message | Request timed out.
Exception type | com.microsoft.azure.servicebus.primitives.TimeoutException |
Failed method | com.microsoft.azure.servicebus.primitives.CoreMessageReceiver$1.run |
Logger Message | UpdateState request timed out. Delivery: ????A????
Message | Request timed out.
Exception type | com.microsoft.azure.servicebus.primitives.TimeoutException
Failed method | com.microsoft.azure.servicebus.primitives.CoreMessageReceiver$1.run
Logger Message | USERCALLBACK encountered exception:Request timed out.
Message | Request timed out.
Exception type | com.microsoft.azure.servicebus.primitives.TimeoutException |
Failed method | com.microsoft.azure.servicebus.primitives.CoreMessageReceiver$1.run |
Logger Message | onMessage with message containing sequence number '56294995342236104' threw exception
All with the same call stack
com.microsoft.azure.servicebus.primitives.TimeoutException:
at com.microsoft.azure.servicebus.primitives.CoreMessageReceiver$1.run (CoreMessageReceiver.java164)
at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java511)
at java.util.concurrent.FutureTask.runAndReset (FutureTask.java308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301 (ScheduledThreadPoolExecutor.java180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java294)
at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java624)
at java.lang.Thread.run (Thread.java748)
Ok. Your client seems to be running fine. Can you give me the namespace, entity name, and the sequence number of the message for which completions timed out 10 consecutive times? Also at least one timestamp (UTC) of the timeout exception? I will take a look at the logs on the service and see if something is wrong.
Ok. Your client seems to be running fine. Can you give me the namespace, entity name, and the sequence number of the message for which completions timed out 10 consecutive times? Also at least one timestamp (UTC) of the timeout exception? I will take a look at the logs on the service and see if something is wrong.
namespace: livestock-cloud-service-bus-prod entity: animal-end-of-day sequence number: 21110623253404171 exception timestamp: 09/12/2018, 07:17:20
I looked at the traces on the service. One of the front end nodes was experiencing high CPU at the time and some requests from your client to that node timed out. The message with the sequence number you provided couldn't be completed multiple times, but its lock was renewed few times. So some operations timed out and some succeeded. Unfortunately all timed out operations are 'complete' calls. After it was deadlettered, it was received on a connection established to a different node and completed immediately. So I attribute these failures to a high CPU. You should have noticed it on only a specific connection. By connection I mean a SubscriptionClient. Not all subscription clients should have experienced the same problem. We have since then rolled out a new build to production on the service, and these high CPU issues are fixed. If you experience this issue again, please do let me know or open a support ticket with Azure Service Bus. I can confirm that it is not a client side issue.
I looked at the traces on the service. One of the front end nodes was experiencing high CPU at the time and some requests from your client to that node timed out. The message with the sequence number you provided couldn't be completed multiple times, but its lock was renewed few times. So some operations timed out and some succeeded. Unfortunately all timed out operations are 'complete' calls. After it was deadlettered, it was received on a connection established to a different node and completed immediately. So I attribute these failures to a high CPU. You should have noticed it on only a specific connection. By connection I mean a SubscriptionClient. Not all subscription clients should have experienced the same problem. We have since then rolled out a new build to production on the service, and these high CPU issues are fixed. If you experience this issue again, please do let me know or open a support ticket with Azure Service Bus. I can confirm that it is not a client side issue.
Thank you for the reply & information! Checked the last couple of weeks (I was out of office for a while) and I saw the issue occurred ones again, yesterday at 04:53:17 with sequence number: 3096224743940795
I looked at the traces on the service. One of the front end nodes was experiencing high CPU at the time and some requests from your client to that node timed out. The message with the sequence number you provided couldn't be completed multiple times, but its lock was renewed few times. So some operations timed out and some succeeded. Unfortunately all timed out operations are 'complete' calls. After it was deadlettered, it was received on a connection established to a different node and completed immediately. So I attribute these failures to a high CPU. You should have noticed it on only a specific connection. By connection I mean a SubscriptionClient. Not all subscription clients should have experienced the same problem. We have since then rolled out a new build to production on the service, and these high CPU issues are fixed. If you experience this issue again, please do let me know or open a support ticket with Azure Service Bus. I can confirm that it is not a client side issue.
FYI this morning the issue reappeared. (after working for several months)
Hay
I've recently updated our source to use the 1.2.8 version and I in our production environment we get strange issues. After a random number of days the subscription client starts producing TimeoutExceptions with message "UpdateState request timed out. Delivery:?????????".
The application processes the message successfully but the final call to the service bus to ack the message fails. This results in endless processing of messages.
I've been reading an existing issue #272 and implemented our application in the recommended way.
Registration
Handler
Processor
Any idea how to resolve this issue?