Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.33k stars 1.97k forks source link

[BUG] "Operator called default onErrorDropped" message caused by NullPointerException in freed connection #42429

Open haraldk opened 2 hours ago

haraldk commented 2 hours ago

Describe the bug

Occasionally (several times a day) we get two errors in our application's log, first "{"az.sdk.message":"Retry attempts exhausted or exception was not retriable.","exception":"Cannot invoke \"java.util.List.add(Object)\" because \"this._sessions\" is null", ..." followed by one or more "Operator called default onErrorDropped" messages.

I don't think messages are actually lost (not processed) due to this, but the message is still very annoying, and as it happens often it creates noise in the logs. It does happen in all our environments.

I think this might be related to #41637 but I was surprised to see that upgrading to the latest servicebus release (7.17.5) did not remove the issue.

Exception or Stack Trace

We get two error messages in our log, only the second contains a stack trace. But I believe they are related:

our-application: ERROR:  {"az.sdk.message":"Retry attempts exhausted or exception was not retriable.","exception":"Cannot invoke \"java.util.List.add(Object)\" because \"this._sessions\" is null","connectionId":"MF_5f89df_1729161349513","entityPath":"$cbs","tryCount":1}
our-application: ERROR:  Operator called default onErrorDropped
reactor.core.Exceptions$ErrorCallbackNotImplemented: java.lang.NullPointerException: Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null
Caused by: java.lang.NullPointerException: Cannot invoke "java.util.List.add(Object)" because "this._sessions" is null
    at org.apache.qpid.proton.engine.impl.ConnectionImpl.session(ConnectionImpl.java:91)
    at org.apache.qpid.proton.engine.impl.ConnectionImpl.session(ConnectionImpl.java:39)
    at com.azure.core.amqp.implementation.ReactorConnection.lambda$createSession$15(ReactorConnection.java:342)
    at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(Unknown Source)
    at com.azure.core.amqp.implementation.ReactorConnection.lambda$createSession$16(ReactorConnection.java:339)
    at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:113)
    at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onNext(MonoPeekTerminal.java:180)
    at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onNext(FluxHide.java:137)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.request(MonoIgnoreThen.java:164)
    at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.request(FluxHide.java:152)
    at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.request(MonoPeekTerminal.java:139)
    at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194)
    at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
    at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194)
    at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171)
    at reactor.core.publisher.FluxPeekFuseable$PeekFuseableSubscriber.request(FluxPeekFuseable.java:144)
    at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.request(Operators.java:2331)
    at com.azure.core.amqp.implementation.AmqpChannelProcessor.requestUpstream(AmqpChannelProcessor.java:336)
    at com.azure.core.amqp.implementation.AmqpChannelProcessor.lambda$onError$4(AmqpChannelProcessor.java:230)
    at reactor.core.publisher.LambdaMonoSubscriber.onNext(LambdaMonoSubscriber.java:171)
    at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:270)
    at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:285)
    at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
    at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
    at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)

To Reproduce

I don't have exact reproducer code at this time, as the code is quite complicated. We use a standard ServiceBusProcessorClient created from a ServiceBusClientBuilder like this:

new ServiceBusClientBuilder()
                     .connectionString(queueProperties.connectionString())
                     .processor()
                     .queueName(queueProperties.name())
                     .disableAutoComplete()
                     .maxAutoLockRenewDuration(Duration.ofMinutes(10))
                     .processMessage(this::processMessage)
                     .processError(this::processError)
                     .buildProcessorClient()

However, none of our code is actually involved in the stack trace above, so I don't think the contents of processMessage or processError matters.

From what I can see, the only way the _sessions field in org.apache.qpid.proton.engine.impl.ConnectionImpl can be null, is someone first invoking doFree(). Ie. it seems that someone/something tries to create a new session from an already freed connection. This might be due to a race condition, or a simpler logic bug somewhere.

Let me know if you need more information to reproduce.

Code Snippet

No code at the moment.

Expected behavior

I would expect the connection to be freed and a new session created in a new, fresh session, without any exceptions or error messages in the log.

Setup (please complete the following information):

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

github-actions[bot] commented 2 hours ago

@anuchandy @conniey @lmolkova

github-actions[bot] commented 2 hours ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.