Azure / azure-iot-sdk-java

A Java SDK for connecting devices to Microsoft Azure IoT services
https://azure.github.io/azure-iot-sdk-java/
Other
198 stars 237 forks source link

[Bug Report] DeviceClient doesn't fully reconnect after temporary network disconnect (AMQP) #1467

Closed bouff closed 2 years ago

bouff commented 2 years ago

Context

Description of the issue

In version 1.34.1 of the SDK, we performed a test where we temporarily disconnect network connectivity (pulled the ethernet cable). We leave the application running for 2 minutes in this state and then restore network connectivity (ethernet cable reconnected). We expect that Twin, RPC, and events would resume normal function after network connecitivity is restored. Instead, we observed that Twin and RPC do not function. Events work do though (queued messages are being sent). In the logs, we found that exception AmqpConnectionFramingErrorException is thrown. Looking at the code, it is a retryable exception. So we'd expect connection status of IotHubConnectionStatus.DISCONNECTED_RETRYING. But instead, we encounter IotHubConnectionStatus.DISCONNECTED.

Code sample exhibiting the issue

Existing samples in this repo should exhibit the issue.

Console log of the issue

Disconnect occurs at 3:41:09. Exception AmqpConnectionFramingErrorException is thrown. Our application's connection listener is triggered with status IotHubConnectionStatus.DISCONNECTED and reason IotHubConnectionStatusChangeReason.CLIENT_CLOSE. We don't attempt to re-instantiate a DeviceClient instance, since we expect this to be used for graceful disconnect.

Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.935 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] ERROR c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Amqp transport error occurred, calling onConnectionLocalClose
Feb 03 03:41:09 somehost some.app[6102]: com.microsoft.azure.sdk.iot.device.transport.amqps.exceptions.AmqpConnectionFramingErrorException: connection aborted
Feb 03 03:41:09 somehost some.app[6102]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsExceptionTranslator.convertFromAmqpException(AmqpsExceptionTranslator.java:44)
Feb 03 03:41:09 somehost some.app[6102]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection.onTransportError(AmqpsIotHubConnection.java:581)
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:191)
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
Feb 03 03:41:09 somehost some.app[6102]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.ReactorRunner.call(ReactorRunner.java:50)
Feb 03 03:41:09 somehost some.app[6102]:         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Feb 03 03:41:09 somehost some.app[6102]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Feb 03 03:41:09 somehost some.app[6102]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Feb 03 03:41:09 somehost some.app[6102]:         at java.base/java.lang.Thread.run(Thread.java:829)
Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.935 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] DEBUG c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Amqp connection closed locally, shutting down all active sessions...
Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.935 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - onSessionClosedAsExpected callback executed, notifying transport layer
Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.936 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] DEBUG c.m.a.s.i.d.t.IotHubTransport - The device session in the multiplexed connection to the IoT Hub has been lost for device redacted-iot-id
Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.936 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] DEBUG c.m.a.s.i.d.t.IotHubTransport - Updating device redacted-iot-id status to new status DISCONNECTED with reason CLIENT_CLOSE
Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.936 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] DEBUG c.m.a.s.i.d.t.IotHubTransport - Invoking connection status callbacks with new status details
Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.936 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] INFO  o.o.d.AzureDeviceClientProvider - Connection status [DISCONNECTED]: CLIENT_CLOSE
Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.937 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] INFO  o.o.d.AzureDeviceClientProvider - Notified that client was disconnected. Will not reconnect.
Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.937 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] DEBUG c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Shutting down cbs session...
Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.937 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] TRACE c.m.a.s.i.d.t.a.AmqpsCbsSessionHandler - Closing this CBS session
Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.937 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Closing reactor since connection has closed
Feb 03 03:41:09 somehost some.app[6102]: 03:41:09.941 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] INFO  c.m.a.s.i.d.t.IotHubTransport - Handling a disconnection event
Feb 03 03:41:09 somehost some.app[6102]: com.microsoft.azure.sdk.iot.device.exceptions.TransportException: org.apache.qpid.proton.engine.HandlerException: java.lang.NullPointerException
Feb 03 03:41:09 somehost some.app[6102]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.ReactorRunner.call(ReactorRunner.java:59)
Feb 03 03:41:09 somehost some.app[6102]:         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Feb 03 03:41:09 somehost some.app[6102]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Feb 03 03:41:09 somehost some.app[6102]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Feb 03 03:41:09 somehost some.app[6102]:         at java.base/java.lang.Thread.run(Thread.java:829)
Feb 03 03:41:09 somehost some.app[6102]: Caused by: org.apache.qpid.proton.engine.HandlerException: java.lang.NullPointerException
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:112)
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:292)
Feb 03 03:41:09 somehost some.app[6102]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.ReactorRunner.call(ReactorRunner.java:50)
Feb 03 03:41:09 somehost some.app[6102]:         ... 4 common frames omitted
Feb 03 03:41:09 somehost some.app[6102]: Caused by: java.lang.NullPointerException: null
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.reactor.impl.SelectorImpl.update(SelectorImpl.java:80)
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.reactor.impl.IOHandler.onUnhandled(IOHandler.java:378)
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.engine.BaseHandler.onSelectableWritable(BaseHandler.java:95)
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:227)
Feb 03 03:41:09 somehost some.app[6102]:         at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
Feb 03 03:41:09 somehost some.app[6102]:         ... 7 common frames omitted

Two minutes later, network connectivity is restored:

Feb 03 03:41:11 somehost some.app[6102]: 03:41:11.062 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] INFO  o.o.d.AzureDeviceClientProvider - Connection status [CONNECTED]: CONNECTION_OK
Feb 03 03:41:11 somehost some.app[6102]: 03:41:11.062 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] DEBUG c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Amqp connection opened successfully
Feb 03 03:41:11 somehost some.app[6102]: 03:41:11.062 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] TRACE c.m.a.s.i.d.t.IotHubTransport - Successfully closed and re-opened the iot hub transport connection
Feb 03 03:41:11 somehost some.app[6102]: 03:41:11.062 [redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] TRACE c.m.a.s.i.d.t.IotHubTransport - OnConnectionLost was fired, but for an outdated connection. Ignoring...

And here are events sent (Twin and RPC don't function):

Feb 03 03:41:23 somehost some.app[6102]: 03:41:23.384 [other-redacted-azure-iot-sdk-IotHubSendTask] TRACE c.m.a.s.i.d.t.IotHubTransport - Dequeued a message from waiting queue to be sent ( Message details: Correlation Id [some-id-1] Message Id [some-id-2] )
Feb 03 03:41:23 somehost some.app[6102]: 03:41:23.384 [other-redacted-azure-iot-sdk-IotHubSendTask] TRACE c.m.a.s.i.d.t.IotHubTransport - Adding transport message to the inProgressPackets to wait for acknowledgement ( Message details: Correlation Id [some-id-1] Message Id [some-id-2] )
Feb 03 03:41:23 somehost some.app[6102]: 03:41:23.384 [other-redacted-azure-iot-sdk-IotHubSendTask] INFO  c.m.a.s.i.d.t.IotHubTransport - Sending message ( Message details: Correlation Id [some-id-1] Message Id [some-id-2] )
Feb 03 03:41:23 somehost some.app[6102]: 03:41:23.384 [other-redacted-azure-iot-sdk-IotHubSendTask] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Adding message to amqp message queue to be sent later ( Message details: Correlation Id [some-id-1] Message Id [some-id-2] )
Feb 03 03:41:23 somehost some.app[6102]: 03:41:23.384 [other-redacted-azure-iot-sdk-IotHubSendTask] TRACE c.m.a.s.i.d.t.IotHubTransport - Sent message ( Message details: Correlation Id [some-id-1] Message Id [some-id-2] ) to protocol level, returned status code was OK
Feb 03 03:41:23 somehost some.app[6102]: 03:41:23.419 [other-redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Sending message over amqp ( Message details: Correlation Id [some-id-1] Message Id [some-id-2] )
Feb 03 03:41:23 somehost some.app[6102]: 03:41:23.419 [other-redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] TRACE c.m.a.s.i.d.t.a.AmqpsSenderLinkHandler - Converting IoT Hub message to proton message for telemetry sender link with address /devices/redacted-iot-id/messages/events and link correlation id some-id-3. IoT Hub message correlationId some-id-1
Feb 03 03:41:23 somehost some.app[6102]: 03:41:23.420 [other-redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] TRACE c.m.a.s.i.d.t.a.AmqpsSenderLinkHandler - Sending 464 bytes over the amqp telemetry sender link with address /devices/redacted-iot-id/messages/events and link correlation id some-id-3 with link credit
Feb 03 03:41:23 somehost some.app[6102]: 03:41:23.420 [other-redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] TRACE c.m.a.s.i.d.t.a.AmqpsSenderLinkHandler - Message was sent over telemetry sender link with address /devices/redacted-iot-id/messages/events and link correlation id some-id-3 with delivery tag 1
Feb 03 03:41:23 somehost some.app[6102]: 03:41:23.420 [other-redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] TRACE c.m.a.s.i.d.t.a.AmqpsSenderLinkHandler - Current link credit on telemetry sender link with address /devices/redacted-iot-id/messages/events and link correlation id some-id-3 is 999
Feb 03 03:41:23 somehost some.app[6102]: 03:41:23.422 [other-redacted-azure-iot-sdk-ReactorRunner-ConnectionOwner] TRACE c.m.a.s.i.d.t.a.AmqpsSenderLinkHandler - Link flow received on telemetry sender link with address /devices/redacted-iot-id/messages/events and link correlation id some-id-3. Current link credit is now 999.

Thanks for your time!

timtay-microsoft commented 2 years ago

Just to clarify a couple things:

bouff commented 2 years ago

Just to clarify a couple things:

  • Are you seeing the client lose connection without it retrying to open the connection?

My apologies, I truncated the part of the logs that show retrying to open the connection. I'll provide a new, complete set of logs in a new separate comment.

  • Do you re-open the client after getting the final DISCONNECTED callback?

We re-open the client under certain circumstances.

We don't when the change reason is IotHubConnectionStatusChangeReason.CLIENT_CLOSE, since we expect this case to be reserved for graceful disconnects (as per javadoc). For this case, we take no action.

For IotHubConnectionStatusChangeReason.COMMUNICATION_ERROR or IotHubConnectionStatusChangeReason.RETRY_EXPIRED, we keep the same instance of DeviceClient and re-open (we call closeNow() followed by open()).

For all other cases, we re-initialize (we call closeNow() on the current instance and then create a new instance of DeviceClient).

  • Are you setting any custom retry policy on the client, or are you using the default?

We are using the default retry policy.

bouff commented 2 years ago

Here's a complete sequence of events - with logs at every step.

timtay-microsoft commented 2 years ago

The issue here is likely that you aren't re-subscribing to twin and methods after re-opening your client. A given deviceClient instance will hold onto your twin/method subscriptions up to the point when it is closed. Any subsequent open call should be followed by re-subscribing to twin/methods. A client will still hold onto your subscriptions after a successful automatic reconnect, though (DISCONNECTED_RETRYING -> CONNECTED)

DeviceClient client = new DeviceClient(/*omitted for brevity*/);
client.open();
client.subscribeToMethods(/*omitted for brevity*/);

// some time passes, maybe network drops followed by automatic reconnections, but methods will be received throughout

// This call will erase all subscriptions
client.close();

// after this open call, methods will not be received until you resubscribe
client.open();

Can you try re-subscribing to twin/methods after opening your client and see if you still see this behavior?

timtay-microsoft commented 2 years ago

@bouff have you tried the above suggestion?

bouff commented 2 years ago

Apologies for the delay @timtay-microsoft and thanks for your response.

Our callback that handles re-subscribing was not invoked - we purposely exclude the case of DISCONNECTED with reason CLIENT_CLOSE (since we expect this to be for graceful shutdowns, not network disconnect).

Given AmqpConnectionFramingErrorException, which is a retryable exception, we expected the SDK to reconnect (DISCONNECTED_RETRYING -> CONNECTED).

We were previously using version 1.29.1 of the SDK. In this version, temporary disconnect and reconnect works (note that it takes 20 minutes for reconnect to occur - different issue though).

Below are logs excerpts that show the behavior of the older client. When the disconnect is detected, we see DISCONNECTED_RETRYING with NO_NETWORK. The exception is ProtonIOException, which is retryable. The SDK automatically reconnects as expected and the application behaves as expected.

Timeline of events:

Logs

Disconnect detected:

Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.233 [azure-iot-sdk-ReactorRunner] ERROR c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Amqp transport error occurred, closing the amqps connection
Feb 24 22:05:06 somehost some.app[18362]: com.microsoft.azure.sdk.iot.device.transport.amqps.exceptions.ProtonIOException: Connection timed out
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsExceptionTranslator.convertFromAmqpException(AmqpsExceptionTranslator.java:84)
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection.onTransportError(AmqpsIotHubConnection.java:487)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:191)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.IotHubReactor.run(IotHubReactor.java:26)
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1118)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.lang.Thread.run(Thread.java:829)
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.238 [azure-iot-sdk-ReactorRunner] INFO  c.m.a.s.i.d.t.IotHubTransport - Handling a disconnection event
Feb 24 22:05:06 somehost some.app[18362]: com.microsoft.azure.sdk.iot.device.exceptions.TransportException: org.apache.qpid.proton.engine.HandlerException: java.lang.NullPointerException
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1122)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.lang.Thread.run(Thread.java:829)
Feb 24 22:05:06 somehost some.app[18362]: Caused by: org.apache.qpid.proton.engine.HandlerException: java.lang.NullPointerException
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:112)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:292)
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.IotHubReactor.run(IotHubReactor.java:26)
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1118)
Feb 24 22:05:06 somehost some.app[18362]:         ... 4 common frames omitted
Feb 24 22:05:06 somehost some.app[18362]: Caused by: java.lang.NullPointerException: null
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.SelectorImpl.update(SelectorImpl.java:80)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.IOHandler.onUnhandled(IOHandler.java:378)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.BaseHandler.onSelectableWritable(BaseHandler.java:95)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:227)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
Feb 24 22:05:06 somehost some.app[18362]:         ... 8 common frames omitted
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.238 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.IotHubTransport - Due to disconnection event, clearing active queues, and re-queueing them to waiting queues to be re-processed later upon reconnection
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.239 [azure-iot-sdk-IotHubSendTask] WARN  c.m.azure.sdk.iot.device.Message - The message with correlation id e8e5e2e0-d08d-4f1b-a863-54fc3b401d9b expired
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.239 [azure-iot-sdk-IotHubSendTask] DEBUG c.m.a.s.i.d.t.IotHubTransport - Invoking the callback function for sent message, IoT Hub responded to message ( Message details: Correlation Id [e8e5e2e0-d08d-4f1b-a863-54fc3b401d9b] Message Id [f6db5689-462d-4721-ba9a-b32efe87bb13] ) with status MESSAGE_EXPIRED
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.240 [azure-iot-sdk-ReactorRunner] DEBUG c.m.a.s.i.d.t.IotHubTransport - Mapping throwable to NO_NETWORK because it was a retryable exception
Feb 24 22:05:06 somehost some.app[18362]: com.microsoft.azure.sdk.iot.device.exceptions.TransportException: org.apache.qpid.proton.engine.HandlerException: java.lang.NullPointerException
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1122)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.lang.Thread.run(Thread.java:829)
Feb 24 22:05:06 somehost some.app[18362]: Caused by: org.apache.qpid.proton.engine.HandlerException: java.lang.NullPointerException
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:112)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:292)
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.IotHubReactor.run(IotHubReactor.java:26)
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1118)
Feb 24 22:05:06 somehost some.app[18362]:         ... 4 common frames omitted
Feb 24 22:05:06 somehost some.app[18362]: Caused by: java.lang.NullPointerException: null
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.SelectorImpl.update(SelectorImpl.java:80)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.IOHandler.onUnhandled(IOHandler.java:378)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.BaseHandler.onSelectableWritable(BaseHandler.java:95)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:227)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
Feb 24 22:05:06 somehost some.app[18362]:         ... 8 common frames omitted
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.241 [azure-iot-sdk-ReactorRunner] WARN  c.m.a.s.i.d.t.IotHubTransport - Updating transport status to new status DISCONNECTED_RETRYING with reason NO_NETWORK
Feb 24 22:05:06 somehost some.app[18362]: com.microsoft.azure.sdk.iot.device.exceptions.TransportException: org.apache.qpid.proton.engine.HandlerException: java.lang.NullPointerException
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1122)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Feb 24 22:05:06 somehost some.app[18362]:         at java.base/java.lang.Thread.run(Thread.java:829)
Feb 24 22:05:06 somehost some.app[18362]: Caused by: org.apache.qpid.proton.engine.HandlerException: java.lang.NullPointerException
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:112)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:292)
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.IotHubReactor.run(IotHubReactor.java:26)
Feb 24 22:05:06 somehost some.app[18362]:         at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1118)
Feb 24 22:05:06 somehost some.app[18362]:         ... 4 common frames omitted
Feb 24 22:05:06 somehost some.app[18362]: Caused by: java.lang.NullPointerException: null
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.SelectorImpl.update(SelectorImpl.java:80)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.reactor.impl.IOHandler.onUnhandled(IOHandler.java:378)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.BaseHandler.onSelectableWritable(BaseHandler.java:95)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:227)
Feb 24 22:05:06 somehost some.app[18362]:         at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
Feb 24 22:05:06 somehost some.app[18362]:         ... 8 common frames omitted
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.241 [azure-iot-sdk-ReactorRunner] DEBUG c.m.a.s.i.d.t.IotHubTransport - Invoking connection status callbacks with new status details
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.241 [azure-iot-sdk-ReactorRunner] INFO  o.o.d.AzureDeviceClientProvider - Connection status [DISCONNECTED_RETRYING]: NO_NETWORK
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.244 [azure-iot-sdk-ReactorRunner] DEBUG c.m.a.s.i.d.t.IotHubTransport - Starting reconnection logic
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.245 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.IotHubTransport - Attempting reconnect attempt 0
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.246 [azure-iot-sdk-ReactorRunner] DEBUG c.m.a.s.i.d.transport.RetryDecision - NOTE: A new instance of RetryDecision has been created with retry enabled, the client will retry after 189 milliseconds
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.247 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.IotHubTransport - Sleeping between reconnect attempts
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.436 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.IotHubTransport - Attempting to close and re-open the iot hub transport connection...
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.437 [azure-iot-sdk-ReactorRunner] DEBUG c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Shutting down amqp layer...
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.437 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - OpenAsync called for amqp connection
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.437 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Closing amqp connection
Feb 24 22:05:06 somehost some.app[18362]: 22:05:06.437 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Waiting for reactor to close...

Reconnection:

Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.438 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Shutdown of executor service has started
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.438 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Shutdown of executor service completed
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.438 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Amqp connection closed successfully
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.438 [azure-iot-sdk-ReactorRunner] DEBUG c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Opening amqp layer...
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.439 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Initializing authentication link latch count to 1
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.439 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Initializing device session latch for device redacted-iot-id
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.439 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - OpenAsnyc called for amqp connection
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.439 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Creating new executor service
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.440 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Waiting for authentication links to open...
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.441 [azure-iot-sdk-ReactorRunner] DEBUG c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Shutting down amqp layer...
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.442 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - OpenAsync called for amqp connection
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.442 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Closing amqp connection
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.442 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Waiting for reactor to close...
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.445 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsIotHubConnection - Amqp connection opened locally
Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.446 [azure-iot-sdk-ReactorRunner] TRACE c.m.a.s.i.d.t.a.AmqpsSessionHandler - Opening device session for device redacted-iot-id

...

Feb 24 22:05:26 somehost some.app[18362]: 22:05:26.798 [azure-iot-sdk-ReactorRunner] INFO  o.o.d.AzureDeviceClientProvider - Connection status [CONNECTED]: CONNECTION_OK
timtay-microsoft commented 2 years ago

Are you using these device clients within the context of a multiplexing client, or are they all operating on their own connection in these logs?

bouff commented 2 years ago

Hi @timtay-microsoft,

We're not using the multiplexing client.

timtay-microsoft commented 2 years ago

I was able to repro this same issue on multiplexed device clients but have been having a hard time repro'ing it for single device client instances. For the multiplexed device clients, #1496 fixed this bug. Can you try building from this repo's source from main and seeing if you can still repro this for non-multiplexed device clients?

bouff commented 2 years ago

Sure thing! 👍

I'll give it a try tomorrow.

timtay-microsoft commented 2 years ago

FYI, we've released the 2.0.0 clients now which include #1496 in case it makes it easier for you to try out.

timtay-microsoft commented 2 years ago

@bouff Have you been able to repro this error after upgrading to 2.0.0?

bouff commented 2 years ago

Apologies for the delay @timtay-microsoft, I'll dedicate time for this next week.

timtay-microsoft commented 2 years ago

No rush! Just let me know if you need any more help on this issue

timtay-microsoft commented 2 years ago

I'll go ahead and close this issue for now. If you are still having this issue after upgrading to the latest release, then go ahead and open a new issue on this page.

sorinpenteleiciuc commented 2 years ago

Is this fix released in 1.34.2 ?