Azure / azure-iot-sdk-python

A Python SDK for connecting devices to Microsoft Azure IoT services
MIT License
417 stars 374 forks source link

Endless connect - disconnect loop due to ConnectionDroppedError('Paho returned rc==1') #1168

Closed gdebezpredel closed 5 months ago

gdebezpredel commented 5 months ago

Context

Description of the issue

One of my IoT devices encountered a strange issue where it got disconnected from the IoTHub and then wasn't able to connect and stay connected. It'll reconnect immediately just for the connection to be dropped again, and this will happen 100's of times a minute. And with each reconnect there's a twin read which led to exhaustion of daily message quota.

The device disconnected because of MQTTTransportStage: _on_mqtt_disconnect called: ConnectionDroppedError('Paho returned rc==1'). [pipeline_stages_mqtt:_on_mqtt_disconnected]. The issue seems to have stopped after device reboot and I wasn't able to reproduce the issue. rc==1 seems to indicate that something went wrong on the IoTHub side but it isn't clear what exactly. I was able to cause rc==1 by manually disabling the device connection from IoTHub page but in my testing the following reconnect attempts would fail right away with ConnectionRefused error.

What could possibly cause the connection drops? Is there a setting I can use for a device to wait a certain time before attempting to reconnect? Maybe some setting on IoTHub to prevent this unintended DDoS attack? Any other recommendations?

I saw that there's a bug fix in 2.12.0 'Prevented infinite twin request loop' but it seems that's a different issue.

Console log of the issue

2023-12-13 23:39:06 +0100 INFO : Successfully sent message to Hub. [sync_clients:send_message] 2023-12-13 23:42:40 +0100 INFO : disconnected with result code: 1. [mqtt_transport:on_disconnect] 2023-12-13 23:42:40 +0100 INFO : Forcing paho disconnect to prevent it from automatically reconnecting. [mqtt_transport:_force_transport_disconnect_and_cleanup] 2023-12-13 23:42:40 +0100 INFO : MQTTTransportStage: _on_mqtt_disconnect called: ConnectionDroppedError('Paho returned rc==1'). [pipeline_stages_mqtt:_on_mqtt_disconnected] 2023-12-13 23:42:40 +0100 INFO : MQTTTransportStage: disconnection was unexpected. [pipeline_stages_mqtt:_on_mqtt_disconnected] 2023-12-13 23:42:40 +0100 INFO : Unexpected disconnection. Safe to ignore since other stages will reconnect.. [handle_exceptions:swallow_unraised_exception] 2023-12-13 23:42:40 +0100 INFO : azure.iot.device.common.transport_exceptions.ConnectionDroppedError: ConnectionDroppedError('Paho returned rc==1')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File ".../lib/python3.9/site-packages/azure/iot/device/common/handle_exceptions.py", line 43, in swallow_unraised_exception raise e azure.iot.device.common.transport_exceptions.ConnectionDroppedError: ConnectionDroppedError(None) caused by ConnectionDroppedError('Paho returned rc==1') . [handle_exceptions:swallow_unraised_exception] 2023-12-13 23:42:40 +0100 INFO : Connection State - Disconnected. [abstract_clients:_on_disconnected] 2023-12-13 23:42:40 +0100 INFO : Cleared all pending method requests due to disconnect. [abstract_clients:_on_disconnected] 2023-12-13 23:42:40 +0100 INFO : Connect using port 443 (websockets). [mqtt_transport:connect] 2023-12-13 23:42:40 +0100 INFO : connected with result code: 0. [mqtt_transport:on_connect] 2023-12-13 23:42:40 +0100 INFO : _on_mqtt_connected called. [pipeline_stages_mqtt:_on_mqtt_connected] 2023-12-13 23:42:40 +0100 INFO : {}: Reconnected. Getting twin. [pipeline_stages_iothub:_handle_pipeline_event] 2023-12-13 23:42:40 +0100 INFO : EnsureDesiredPropertiesStage: sending twin GET to ensure freshness. [pipeline_stages_iothub:_ensure_get_op] 2023-12-13 23:42:40 +0100 INFO : ReconnectStage(ConnectOperation): State changes LOGICALLY_CONNECTED->LOGICALLY_CONNECTED. Adding to wait list and sending new connect op down. [pipeline_stages_base:_run_op] 2023-12-13 23:42:40 +0100 INFO : CoordinateRequestAndResponseStage: ConnectedEvent: re-publishing request 3cc646d3-1e4e-40bb-aa5f-c33b5347f70c for GET twin . [pipeline_stages_base:_handle_pipeline_event] 2023-12-13 23:42:40 +0100 INFO : Connection State - Connected. [abstract_clients:_on_connected] 2023-12-13 23:42:40 +0100 INFO : ConnectionLockStage(ConnectOperation): Transport is already connected. Completing.. [pipeline_stages_base:_run_op] 2023-12-13 23:42:40 +0100 INFO : publishing on $iothub/twin/GET/?$rid=3cc646d3-1e4e-40bb-aa5f-c33b5347f70c. [mqtt_transport:publish] 2023-12-13 23:42:40 +0100 INFO : publishing on $iothub/twin/GET/?$rid=3cc646d3-1e4e-40bb-aa5f-c33b5347f70c. [mqtt_transport:publish] 2023-12-13 23:42:40 +0100 INFO : message received on $iothub/twin/res/200/?$rid=3cc646d3-1e4e-40bb-aa5f-c33b5347f70c. [mqtt_transport:on_message] 2023-12-13 23:42:40 +0100 INFO : payload published for 517. [mqtt_transport:on_publish] 2023-12-13 23:42:40 +0100 INFO : message received on $iothub/twin/res/200/?$rid=3cc646d3-1e4e-40bb-aa5f-c33b5347f70c. [mqtt_transport:on_message] 2023-12-13 23:42:40 +0100 INFO : CoordinateRequestAndResponseStage(ResponseEvent): request_id 3cc646d3-1e4e-40bb-aa5f-c33b5347f70c not found in pending list. Nothing to do. Dropping. [pipeline_stages_base:_handle_pipeline_event] 2023-12-13 23:42:40 +0100 INFO : payload published for 518. [mqtt_transport:on_publish] 2023-12-13 23:42:40 +0100 INFO : disconnected with result code: 1. [mqtt_transport:on_disconnect] 2023-12-13 23:42:40 +0100 INFO : Forcing paho disconnect to prevent it from automatically reconnecting. [mqtt_transport:_force_transport_disconnect_and_cleanup] 2023-12-13 23:42:40 +0100 INFO : MQTTTransportStage: _on_mqtt_disconnect called: ConnectionDroppedError('Paho returned rc==1'). [pipeline_stages_mqtt:_on_mqtt_disconnected] 2023-12-13 23:42:40 +0100 INFO : MQTTTransportStage: disconnection was unexpected. [pipeline_stages_mqtt:_on_mqtt_disconnected] 2023-12-13 23:42:40 +0100 INFO : Unexpected disconnection. Safe to ignore since other stages will reconnect.. [handle_exceptions:swallow_unraised_exception] 2023-12-13 23:42:40 +0100 INFO : azure.iot.device.common.transport_exceptions.ConnectionDroppedError: ConnectionDroppedError('Paho returned rc==1')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File ".../lib/python3.9/site-packages/azure/iot/device/common/handle_exceptions.py", line 43, in swallow_unraised_exception raise e azure.iot.device.common.transport_exceptions.ConnectionDroppedError: ConnectionDroppedError(None) caused by ConnectionDroppedError('Paho returned rc==1') . [handle_exceptions:swallow_unraised_exception]

... and this will repeat infinitely

cartertinney commented 5 months ago

@gdebezpredel

Can you try updating to the latest version of azure.iot.device - this probably won't solve the problem by itself, but it will change the behavior in a way that will be more useful to getting to the root of the problem. Reconnection logic has been heavily changed since the version you're using to be more judicious about reconnects when things are going wrong, and to better report what is ocurring.

If you could update to 2.12 and then post some logs, we can look into it further

To get logs, please add the following lines to your application

import logging
logging.basicConfig(level=logging.ERROR)
gdebezpredel commented 5 months ago

Thank you @cartertinney, yes I'm already in the process of updating it but I doubt I would be able to reproduce the issue in order to collect new logs, it sort of just happened randomly.

I guess I want to know if there's any way to prevent this kind of problem in the future. Maybe there's a way to add a wait before reconnect? Also, can you point me to the reconnection logic code?

cartertinney commented 5 months ago

@gdebezpredel

What I mean is that the infinite reconnect that eats up your quota will probably be fixed by the latest package. The disconnection itself is something else, likely to do with the Hub, that it won't fix. Ideally once you update the reconnect itself won't be the issue.

As far as reconnection logic, it's somewhat spread out, but the majority is contained in this class/state machine. But, like I said, 2.12 should hopefully resolve the reconnect issue itself.

gdebezpredel commented 5 months ago

Got you! Ok then update to 2.12 should fix the main issue. Thank you for the info