Azure / azure-iot-sdk-python

A Python SDK for connecting devices to Microsoft Azure IoT services
MIT License
422 stars 375 forks source link

Device stuck in reconnect loop while renewing SAS token #1181

Closed dsburns closed 3 months ago

dsburns commented 4 months ago

Context

Description of the issue

The iot hub client appears to be stuck in a connect/disconnect loop when reauthorizing it's SAS token. The issue occurs randomly across devices and can take anywhere from 12hours to several days before first occurence. After several minutes of connecting and disconnecting, my application is restarted and the azure client connects as expected.

Summary of events

  1. 00:16:10 - Azure message sent
  2. 00:16:12 - Azure blob uploaded
  3. 00:16:14 - Reauth process for SAS token starts
  4. 00:16:19 - Client disconnects
  5. 00:16:24 - ConnectionFailedError raised in mqtt_transport.py
  6. 00:16:24 - Retrying connection authorization
  7. 00:16:24 - Successfully connects to hub
  8. 00:16:34 - Starting reauthorization process for new SAS token
  9. 00:16:35 - Disconnected with result code 7

Steps 4-9 continue until the application is stopped.

Code sample exhibiting the issue

class Client:

    def __init__(self, connection_str):
        self._connection_str = connection_str
        self._client: IoTHubDeviceClient = None

    def _statistics_increment(self, name):
        self.statistics[name] += 1
        return self.statistics[name]

    async def open(self):
        log.debug('Azure open start')
        if self._client is not None:
            await self.close()
        self._client = IoTHubDeviceClient.create_from_connection_string(self._connection_str)
        await self._client.connect()
        self._twin = await self._client.get_twin()
        log.debug('TWIN: %s', self._twin)
        # set the message handler on the client
        self._client.on_message_received = self._message_handler
        self._client.on_twin_desired_properties_patch_received = self._twin_patch_listener
        self._client.on_method_request_received = self._device_method_listener
        log.debug('Azure open completed')

    async def close(self):
        if self._client is None:
            return
        log.info('Azure close')
        await self._client.disconnect()

    def is_connected(self):
        connected = False
        if self._client:
            connected = self._client.connected
        return connected

    async def message_send(self, payload):
        await self._client.send_message(payload)

def run()
    while not self._exit_event.is_set():
            try:
                log.debug(f"azure service loop - {azure} connected: {azure.is_connected() if azure else False}")

                if not azure and self._azure_enabled:
                    azure = self._azure_factory()

                # Determine if we need to connect to the IoT Hub.  We only try to reconnect on an interval to prevent spamming the
                # IoT Hub with connection attemps.
                poll_time_now = time.time()
                attempt_reconect = not azure.is_connected() and (poll_time_now >= (last_connect_poll_time + AZURE_OPEN_INTERVAL_SEC))
                if attempt_reconect:
                    last_connect_poll_time = poll_time_now
                    log.debug("azure.open() start")
                    await asyncio.wait_for(azure.open(), timeout=10.0)
                    self._set_and_send_azure_connection_status(True)
                    log.debug('azure.open() complete')
                    last_disconnect_ts = 0

                # Send messages - Coroutines are setup to swallow exceptions.  Reconnect logic is handled by the azure.is_connected
                # function which is much more relaible and simpler than trying to propagate connectivity errors back to the main loop
                # from coroutines.
                if azure.is_connected():
                    for handler in self._d2c_handlers:
                        while not handler.queue.empty():
                            _, payload, properties = handler.queue.get(False)
                            task = asyncio.create_task(handler.upload(azure, payload, properties))
                            running_tasks.add(task)
                            task.add_done_callback(running_tasks.discard)
            except (
                azure_exceptions.ClientError,
                azure_exceptions.ConnectionDroppedError,
                azure_exceptions.ConnectionFailedError,
                azure_exceptions.CredentialError,
                azure_exceptions.NoConnectionError,
                azure_exceptions.OperationCancelled,
                azure_exceptions.OperationTimeout,
                azure_exceptions.ServiceError,
            ) as ex:
                log.exception('azure service exception')
                self._set_and_send_azure_connection_status(False)
                disconnect_ts = time.time()
                if azure:
                    await asyncio.wait_for(azure.close(), timeout=10.0)
                    azure = None

            await asyncio.sleep(1.0)

Console log of the issue

See attached logs. The debug.log file contains DEBUG log statements for the 'azure.iot.device' logger. azure_error_summary.log azure_error_debug.log

dsburns commented 4 months ago

Bump

cartertinney commented 4 months ago

@dsburns

Can you tell me a bit more about your use case?

1) Are you using IoTHub or IoT Edge? 2) What type of authentication are you using (Connection String?) 3) Are you specifying a sastoken_ttl value when you create your client? 4) If using IoTHub, are you using a Gateway v1 Hub, or a Gateway v2 Hub?

cartertinney commented 3 months ago

Closed due to inactivty