Azure / azure-iot-sdk-c

A C99 SDK for connecting devices to Microsoft Azure IoT services
https://azure.github.io/azure-iot-sdk-c
Other
587 stars 740 forks source link

Fix AMQP transport to break infinite loop of link re-ATTACH-ments #2472

Closed ewertons closed 1 year ago

ewertons commented 1 year ago
# Checklist - [x] I have read the [contribution guidelines] (https://github.com/Azure/azure-iot-sdk-c/blob/main/.github/CONTRIBUTING.md). - [x] I added or modified the existing tests to cover the change (we do not allow our test coverage to go down). - If this is a modification that impacts the behavior of a public API - [ ] I edited the corresponding document in the `devdoc` folder and added or modified requirements. - I submitted this PR against the correct branch: - [x] This pull-request is submitted against the `main` branch. - [x] I have merged the latest `main` branch prior to submission and re-merged as needed after I took any feedback. - [x] I have squashed my changes into one with a clear description of the change. # Reference/Link to the issue solved with this PR (if any) # Description of the problem
In iothubtransport_amqp_common.c `number_of_previous_failures` is used to track issues with the AMQP device and either restart or re-connect it as appropriately.
Note: AMQP device (iothubtransport_amqp_device.c) is a subcomponent of the AMQP transport, representing each registered device.
The AMQP transport can either just restart a registered AMQP device (causing just re-ATTACH-ment of AMQP links) or completely reconnect it (with END, CLOSE, OPEN, BEGIN, ...).
When a failure occurs (e.g., the Twin messenger reports an error), the AMQP device goes into an error state itself and `number_of_previous_failures` (on AMQP transport layer) is incremented.
If `number_of_previous_failures` is less than `MAX_NUMBER_OF_DEVICE_FAILURES`, the AMQP device is just restarted.
If `number_of_previous_failures` reaches `MAX_NUMBER_OF_DEVICE_FAILURES`, the AMQP device is completely reconnected.
If `number_of_previous_failures` never reaches `MAX_NUMBER_OF_DEVICE_FAILURES` (iothubtransport_amqp_common.c:1058) then the transport will never force a full reconnection, always trying to stop the registered AMQP device.
When IoTHubTransport_AMQP_Common_Device_DoWork is called (for each registered device), if `send_pending_events` succeeds, `number_of_previous_failures` is set to 0.
But all `send_pending_events` does is to add a message on the outgoing queue of the AMQP messenger, so it is always expected to succeed, always resetting `number_of_previous_failures`.
That prevents in such case the AMQP transport from fully reconnecting with the IoT Hub in specific error scenarios - like when the Twin throttling occurs, causing outgoing Twin messages (e.g, PATCH, i.e., reported properties update) to fail.
The AMQP transport keeps trapped in a loop of just trying to restablish the AMQP links in in this case.

Description of the solution

Move the code that resets the number of failures to the correct places, when messages get effectively and successfully sent or received.