Azure / azure-service-bus-java

☁️ Java client library for Azure Service Bus
https://azure.microsoft.com/services/service-bus
MIT License
60 stars 59 forks source link

Too many ReactorThread instances in the thread dump. #365

Open NairRahulMohanan opened 5 years ago

NairRahulMohanan commented 5 years ago

We have two subscription clients receiving messages with an interval of 30 seconds. At times, we stop receiving messages for some reason, and the we have to stop both subscription clients and reconnect them to continue receiving messages. But we can see that with each successive stop and reconnect, the total thread count of the application increases and so is the CPU usage. There are no exceptions thrown by the API during this time period. We are using servicebus version 1.2.13.

On generating a threaddump we could see that out of the 594 threads running on the application at that instant, 524 belonged to instances of ReactorThread. An example stack trace is as follows. `"ReactorThread645d2ac6-eaeb-4599-aa24-7cff615082dc" #33560 prio=5 os_prio=64 tid=0x00000001023bc000 nid=0x832c runnable [0xffffffff118fe000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.DevPollArrayWrapper.poll0(Native Method) at sun.nio.ch.DevPollArrayWrapper.poll(DevPollArrayWrapper.java:223) at sun.nio.ch.DevPollSelectorImpl.doSelect(DevPollSelectorImpl.java:98) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

All of these instances are also in thread state runnable.

It seems that subscriptionClient.close() doesn't actually close the Reactor Thread instance. We checked the heap space utilization and it follows a normal sawtooth pattern without ever crossing 200MB, so no memory leaks.

Could you please confirm if subscriptionClient.close() would actually close the reactor instance too? If not, how to do it. This high CPU usage has been a priority issue for quite some time for our clients. So a quick reply is much appreciated

NairRahulMohanan commented 5 years ago

Please also note that we are passing an executorService while registering a messageHandler on the subscriptionClient and then calling executorService.shutdown() after the subscriptionClient.close().

yvgopal commented 5 years ago

If you are creating a subscriptionclient by calling its constructor, then it internally creates a MessagingFactory for that subscriptionclient. 1 SubscripionClient = 1 MessagingFactory = 1 Reactor

When you close the subscriptionclient, it will close the messaging factory. When messagingfactory is closed, the reactor gracefully exits.

Then answer to your question : YES. Closing subscription client closes the reactor too. Is your subscriptionclient.close call succeeding without any exceptions? If yes, could you share the thread dump? Also, how many new threads do you see every time you close and recreate a new subscription client? Does the thread count fall off slowly? And how high does the CPU usage get? Any idea which threads are consuming more CPU cycles?

NairRahulMohanan commented 5 years ago

@yvgopal Yes. We do get a timeout exception while calling close. We sleep for a predefined interval and try closing again when this happens.

boolean tryAgain = true; if (client != null) { SubscriptionClient clientDummy = client; while(tryAgain) { try { clientDummy.close(); tryAgain = false; } catch (ServiceBusException e) { logger.dbg("Servicebus exception while closing region "+azureRegion+" with instance "+clientDummy.toString()+". Will retry closing after "+propertyReader.getConnectionRetryInterval()/1000+" seconds", Logger.WARNING); logger.stackDump(e); try { Thread.sleep(propertyReader.getConnectionRetryInterval()); } catch (InterruptedException e1) { logger.dbg("Thread interrupted!", Logger.WARNING); logger.stackDump(e1); } } } logger.dbg("Servicebus instance "+clientDummy.toString()+" closed successfully",Logger.INFO); clientDummy = null; }

We have a .dbg file that logs the application activity and a separate log file that writes the console logs. Since, servicebus java api seems to output a lot of logs, the client has set it to debugging level ERROR and the .dbg file is set to write all levels of logging. We didn't find any logs for the corresponding times in the console logs.

Here are some excerpts from the .dbg log

27/05 13:19:37.664 New subscription client with instance com.microsoft.azure.servicebus.SubscriptionClient@3c5dbed7 created 27/05 13:50:03.568 Servicebus exception while closing region 1 with instance com.microsoft.azure.servicebus.SubscriptionClient@3c5dbed7. Will retry closing after 60 seconds 27/05 13:50:03.569 Close operation on Receive Link(Receiver_193ba0_71d16080c0d94927bc3ddff4a35ae55f_G0) timed out at 2019-05-27T13:50:03.548Z[GMT], errorContext[NS: halas-deice-servicebus-preprod-we.servicebus.windows.net, PATH: halasdeicepreprodtopic/subscriptions/243a7d00-dd3a-4654-8001-88156ee306b4, REFERENCE_ID: Receiver_193ba0_71d16080c0d94927bc3ddff4a35ae55f_G0, PREFETCH_COUNT: 0, LINK_CREDIT: 1, PREFETCH_Q_LEN: 0] 27/05 13:50:03.570 at com.microsoft.azure.servicebus.primitives.CoreMessageReceiver$9.run (CoreMessageReceiver.java:933) 27/05 13:50:03.570 at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511) 27/05 13:50:03.570 at java.util.concurrent.FutureTask.run (FutureTask.java:266) 27/05 13:50:03.571 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201 (ScheduledThreadPoolExecutor.java:180) 27/05 13:50:03.571 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java:293) 27/05 13:50:03.571 at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142) 27/05 13:50:03.571 at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617) 27/05 13:50:03.572 at java.lang.Thread.run (Thread.java:745) 27/05 13:51:03.575 Servicebus instance com.microsoft.azure.servicebus.SubscriptionClient@3c5dbed7 closed successfully

The CPU usage starts at around <1% on applicaton start and gradually increases. Over the course of two days its at around 6% now. The thread count starts at 69 threads and grows gradually. It is now at 331 threads now. The network issues I mentioned earlier tend to creep in almost every half an hour. So we reconnect the two subscriptionClients roughly twice every hour. The number of threads also tend to increase by 4 every hour. We will get a new heapdump to you as soon as possible because we do not have permission to generate a heapdump ourselves. It can only be done by the client as its running in their environment.

NairRahulMohanan commented 5 years ago

Prod_thread_dump_20190530.log

As requested please find the thread dump.

yvgopal commented 5 years ago

Ok. The thread dump shows 250 different reactor threads. That means ReactorThread is not gracefully exiting. The root cause is that subscriptionclient close is failing. When you close a subscription client, we close the receive link first, then other links and then the reactor in that order. If one step fails, later steps are not executed. That's why reactor thread keeps running.

Calling close again will not have any effect, given the way we coded it.

There is one action item for you and one low priority item for me. You should go and see why subscriptionclient close consistently times out for you. May be it has got to do with the network issue you mentioned. That should be resolved. If that is resolved, your issue of more threads and CPU will be gone.

For me, I created a ticket #366 to track it. It is low priority because it won't address the problem as long as the underlying root cause that is preventing a link or connection close remains. You should resolve the network issue for your customer without waiting for my fix, as my fix will not really address such issues if the network issue persists.

NairRahulMohanan commented 5 years ago

Thanks @yvgopal . We have let our client know and asked them to check the network issues. Still could you please let us know what timeline to expect for a fix to be in place?

ThanseehAslam commented 3 years ago

@yvgopal do you have any update on this? We are still facing issues that the reactorThread is not closing even though we close it multiple times

Mick-Auger commented 3 years ago

Just to add to this, this issue does occur when the application losses its connection to the Azure end point. Surely the thread should handle connection outages cleanly? Is there any sort of workaround that we could consider to get around this? Any alternative ways of coding the connection so that when a network disconnection occurs the thread closes cleanly?