Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.36k stars 2k forks source link

[BUG] Azure ServiceBus memory leak (due to potential use of deprecated AmqpChannelProcessor) #42717

Closed gataricd closed 2 days ago

gataricd commented 3 weeks ago

Describe the bug The number of ServiceBusReactorAmqpConnection instances is rising until there is no more memory on the heap. When the app is started the number of instances is 44, because we use 44 queues, but with time it rises to above 500 and eventually we get OutOfMemoryError. The heap size is limited to around 500 MB. When the OutOfMemoryError happens, all the ServiceBusReactorAmqpConnection have sessionMap set to 0. This lead us to conclusion that garbage collector is not cleaning these connections that are not being used. We have never had similar problem before and it happens during normal load.

Exception or Stack Trace The following exceptions might be relevant to the problem:

reactor.core.Exceptions$ErrorCallbackNotImplemented: java.lang.NullPointerException: Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null
Caused by: java.lang.NullPointerException: Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null
    at org.apache.qpid.proton.engine.impl.ConnectionImpl.session(ConnectionImpl.java:91)
    at org.apache.qpid.proton.engine.impl.ConnectionImpl.session(ConnectionImpl.java:39)

and

com.azure.core.amqp.exception.AmqpException: onSessionRemoteClose connectionId[MF_1e9f94_1730303967974], entityName[mdk-mnp-command-queue] condition[Error{condition=amqp:connection:forced, description='The connection was closed by container 'ce030eed2af746d4a84caa602d8b170a_G0' because it did not have any active links in the past 300000 milliseconds. TrackingId:ce030eed2af746d4a84caa602d8b170a_G0, SystemTracker:gateway5, Timestamp:2024-10-30T16:24:05', info=null}], errorContext[NAMESPACE: mdk-westeurope-eu-e-servicebus.servicebus.windows.net. ERROR CONTEXT: N/A, PATH: some-queue]
    at com.azure.core.amqp.implementation.ExceptionUtil.toException(ExceptionUtil.java:90)
    at com.azure.core.amqp.implementation.handler.SessionHandler.onSessionRemoteClose(SessionHandler.java:139)

These exceptions started ocurring when we switched from 7.17.1 to 7.17.3.

To Reproduce There are no special steps to reproduce.

Code Snippet The following code is executed for every queue:

val processorClient: ServiceBusProcessorClient = ServiceBusClientBuilder()
            .credential(serviceBusFullyQualifiedName, DefaultAzureCredentialBuilder().build())
            .processor()
            .queueName(queueName)
            .processMessage(handler)
            .processError { context -> processError(context) }
            .maxConcurrentCalls(10)
            .disableAutoComplete()
            .maxAutoLockRenewDuration(Duration.ZERO)
            .prefetchCount(0)
            .buildProcessorClient()
processorClient.start()

Expected behavior The memory consumption of the service bus library shouldn't raise to the point that cause OutOfMemoryError.

Setup (please complete the following information): OS: Linux Library/Libraries: com.azure:azure-messaging-servicebus:7.17.3 Java version: 21 App Server/Environment: AKS Frameworks: Sprint Boot

github-actions[bot] commented 3 weeks ago

@anuchandy @conniey @lmolkova

github-actions[bot] commented 3 weeks ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

anuchandy commented 3 weeks ago

Hello @gataricd, this is resolved recently, you can find more details here https://github.com/Azure/azure-sdk-for-java/issues/41865

Note that you will still see the session disconnect/reconnect logs (which is expected) but the new version should address the NullPointerException. Please follow below steps -

Update to 7.17.5 dependency

<dependency>
    <groupId>com.azure</groupId>
    <artifactId>azure-messaging-servicebus</artifactId>
    <version>7.17.5</version>
</dependency>

Update the ServiceBusClientBuilder for "com.azure.core.amqp.cache"

When building any client (ServiceBusProcessorClient, ServiceBusReceiverClient, ServiceBusSenderClient etc..) use the configuration ("com.azure.core.amqp.cache"), as shown below. Make sure this configuration is selected for all the places where the application creates a new ServiceBusClientBuilder -


new ServiceBusClientBuilder()
            .connectionString(queueProperties.connectionString())
                     .configuration(new ConfigurationBuilder()
                         .putProperty("com.azure.core.amqp.cache", "true")
                         .build())
            .processor()
            .queueName(queueName)
            .processMessage(handler)
            .processError { context -> processError(context) }
            .maxConcurrentCalls(10)
            .disableAutoComplete()
            .maxAutoLockRenewDuration(Duration.ZERO)
            .prefetchCount(0)
            .buildProcessorClient()

Choosing this configuration is important to resolve the problem - java.lang.NullPointerException: Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null

Ensure right transitive dependencies

Make sure the transitive dependencies (azure-core-amqp, azure-core) are resolved to expected versions.

mvn dependency:tree
[INFO] ...
[INFO] +- com.azure:azure-messaging-servicebus:jar:7.17.5:compile
[INFO] |  +- com.azure:azure-core:jar:1.53.0:compile
[INFO] |  |  +- ..
[INFO] |  |  \- ...
[INFO] |  \- com.azure:azure-core-amqp:jar:2.9.10:compile
[INFO] |     +- com.microsoft.azure:qpid-proton-j-extensions:jar:1.2.5:compile
[INFO] |     \- org.apache.qpid:proton-j:jar:0.34.1:compile

Note: In the upcoming version the need for opt-in "com.azure.core.amqp.cache" will be removed

github-actions[bot] commented 3 weeks ago

Hi @gataricd. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

gataricd commented 3 weeks ago

/unresolve @anuchandy thanks for the answer, unfortunately the proposed changes didn't resolve the problem. We are still getting: java.lang.NullPointerException: Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null after updating to 7.17.5 and setting com.azure.core.amqp.cache to true. The transitive dependencies are right.

anuchandy commented 3 weeks ago

Hello @gataricd, thanks for trying it. Could you share -20/+20 minutes of DEBUG logs around this NullPointerException.

AndreasPetersen commented 2 weeks ago

Hi, we are seeing the same issue with 7.17.5.

We can provide the requested logs, but not here. Should we create a support case and upload the logs through that?

gataricd commented 2 weeks ago

We created a support ticket, since we can't upload the logs here.

AndreasPetersen commented 1 week ago

Hi, we are seeing the same issue with 7.17.5.

We can provide the requested logs, but not here. Should we create a support case and upload the logs through that?

My apologies, we didn't read @anuchandy description thoroughly enough initially. After adding

Update the ServiceBusClientBuilder for "com.azure.core.amqp.cache"

as described, we are not longer seeing the issue.

anuchandy commented 1 week ago

Hello @AndreasPetersen, glad to hear that setting the configuration resolved the issue. I apologize for not responding earlier, as I missed this notification. (Yes, If the configuration is not set, a deprecated AmqpChannelProcessor type will be used, which can cause NPE. Setting the configuration will use a new type RequestResponseChannelCache, a type replacing AmqpChannelProcessor).

anuchandy commented 1 week ago

We created a support ticket, since we can't upload the logs here.

Hello @gataricd, the case has not been routed to our team yet. I suspect the configuration "com.azure.core.amqp.cache" did not take effect in your app setup. Once I have received the logs, I will review & follow-up the case. Do you mind if I close this GitHub issue, now that there is ticket?

anuchandy commented 2 days ago

Closing as there is support case for it. Currently, it appears that the opt-in may not be configured in some builders or is not recognized at runtime for the SDK to detect hence library continues to use deprecated AmqpChannelProcessor. If the case conclusion differs from this, I will update here.