configureOfflinePublishQueueing(0) setting in AWSIoTMQTTClient object not working.

==============================================================================

Python 2.7.16 Python 3.7.3 AWSIoTPythonSDK 1.4.9

Raspberry Pi reference 2020-05-27 Raspberry Pi 4 Model B Rev 1.1

Distributor ID: Raspbian Description: Raspbian GNU/Linux 10 (buster) Release: 10 Codename: buster

==============================================================================

I was asked to open this as an issue by AWS Support for which we have a support business plan for. The below is basically describing the same behavior I have seen in a couple of other links I have found here, but so that I am following instructions, I am creating this issue with my specifics.

==============================================================================

Just for or a little bit more background, the way we are using IoT MQTT is a little different in that the interaction with our application that is using MQTT is real-time, human interaction, where there is an immediate action (publish) followed by an immediate reaction (subscribe). For example, we have the Raspberry Pi with a scanner and display connected and the operator scans a LOGIN barcode. That (action) message immediately gets published to our back-end system. Our back-end system then publishes a message back to the Pi stating “SCAN USER ID”. The Raspberry Pi then (reaction) is subscribing and immediately receives that message and shows it on the display. This is one of the real-life examples, but this type of action/reaction is how our application works that is using MQTT so as you can see, when there are communication interruptions, it is crucial that we try and avoid “duplicate” messages when that interruption just happen to occur when our application is trying to publish a message and ensure we at least get 1 through after reconnection, but it is not duplicated.

With that in mind and after doing some more testing trying different combinations of configurations to best suit how we are using MQTT, it appears that using a setting of configureOfflinePublishQueuing(0) may work best for us. This is because, since we are actually perform multiple .publish() attempts until successful, this appears to at least get us one message through, though it appears there can still be a duplicate, but only a few.

Here are the connection setting we are currently using for this approach:

myAWSIoTMQTTClient = None

myAWSIoTMQTTClient = AWSIoTMQTTClient(iot_clientid) myAWSIoTMQTTClient.configureEndpoint(iot_host, iot_port) myAWSIoTMQTTClient.configureCredentials(iot_rootcapath, iot_privatekeypath, iot_certificatepath)

myAWSIoTMQTTClient.configureAutoReconnectBackoffTime(1, 32, 20) myAWSIoTMQTTClient.configureOfflinePublishQueueing(0) myAWSIoTMQTTClient.configureDrainingFrequency(2) myAWSIoTMQTTClient.configureConnectDisconnectTimeout(10) myAWSIoTMQTTClient.configureMQTTOperationTimeout(10)

Here is the code snippet around the .publish()

message_to_publish = '{"message": "3BF16D01-D0EA-450F-A297-29543CE11640~ttyUSB0~PASTREDTM~0000000CCFC4LOGIN\r", "messagetype": "FAASPUBLISHJOB", "urlsetid": 500} '

while True:

           try: 
                          myAWSIoTMQTTClient.publish(pub_topic, message_to_publish, 1)
           except AWSIoTExceptions.publishQueueFullException as iotqfe: 
                          time.sleep(1) 
                          continue 
           except AWSIoTExceptions.publishQueueDisabledException as iotqde: 
                          time.sleep(1) 
                          continue 
           except AWSIoTExceptions.publishTimeoutException as iottoe: 
                          time.sleep(1) 
                          continue 
           except Exception as e: 
                          time.sleep(1) 
                          continue

When using this method, this is how everything unfolds.

When Wifi is disconnected, on the first .publish() attempt it throws an publishTimeoutException exception after 20 seconds, based on our configureMQTTOperationTimeout(20) setting.
While Wifi is still disconnected, all proceeding .publish() attempts throw an immediate publishQueueDisabledExceptionOffline exception.
When Wifi is reconnected, the .publish() is successful, but it appears it still sends more than one of the message we were trying to publish, but with the setting of configureOfflinePublishQueueing(0), it appears it only sends one additional message per number of publishTimeoutException exceptions so since there is only one of those exceptions, I get only one duplicate versus many when we had the setting as configureOfflinePublishQueueing(-1). This is not 100% insurance of getting at least one through with no duplicates requirements, but one duplicate versus many duplicates we can probably live with for and still get the benefit of our application being able to self-recovery from communication disruptions gracefully. However, these are the issues even with this approach.
The behavior explained above is not consistent… sometimes there 1 publishTimeoutException and sometimes up to 3 until it goes into the publishQueueDisabledExceptionOffline exceptions. So when this happens, we could potentially get up to 3 (or more depending) duplicate published messages.
This configureOfflinePublishQueueing() setting doesn’t appear to always take. I can run my sandbox application with this set as (0) and everything work as explained above. Then sometimes running the exact same sandbox application with the exact same setting, it is almost like MQTT thinks the setting is set as configureOfflinePublishQueueing(-1) because only the publishTimeoutException exception occurs on each .publish() attempt and never the publishQueueDisabledExceptionOffline exception as expected.
I use the MQTT logger (logger = logging.getLogger("AWSIoTPythonSDK.core")) to help see what’s going on and I can see MQTT showing the (0) setting, but again, sometimes the behavior is 1 publishTimeoutException, many publishQueueDisabledExceptionOffline and sometimes it’s just many publishTimeoutException, which of occurs will results in that many duplicate messages when reconnected. o 2021-08-03 11:06:10,422 - AWSIoTPythonSDK.core.protocol.internal.clients - DEBUG - Initializing MQTT layer... o 2021-08-03 11:06:10,424 - AWSIoTPythonSDK.core.protocol.internal.clients - DEBUG - Registering internal event callbacks to MQTT layer... o 2021-08-03 11:06:10,425 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - MqttCore initialized o 2021-08-03 11:06:10,425 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Client id: 3BF16D01-D0EA-450F-A297-29543CE11640 o 2021-08-03 11:06:10,426 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Protocol version: MQTTv3.1.1 o 2021-08-03 11:06:10,427 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Authentication type: TLSv1.2 certificate based Mutual Auth. o 2021-08-03 11:06:10,427 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring endpoint... o 2021-08-03 11:06:10,428 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring certificates... o 2021-08-03 11:06:10,429 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring reconnect back off timing... o 2021-08-03 11:06:10,430 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Base quiet time: 1.000000 sec o 2021-08-03 11:06:10,430 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Max quiet time: 32.000000 sec o 2021-08-03 11:06:10,431 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Stable connection time: 20.000000 sec o 2021-08-03 11:06:10,432 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring offline requests queue draining interval: 0.500000 sec o 2021-08-03 11:06:10,433 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring connect/disconnect time out: 10.000000 sec o 2021-08-03 11:06:10,433 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring MQTT operation time out: 20.000000 sec o 2021-08-03 11:06:10,434 - AWSIoTPythonSDK.core.protocol.mqtt_core - INFO - Configuring offline requests queueing: max queue size: 0

So without some type consistency with the setting, I am not feeling comfortable with the approach to try an ensure 1) at least one of the message to publish is successful and 2) that one message is not published more than once.

QUESTION: Would there be some reason that I am not aware of that would cause this inconsistent behavior with this setting? I have ran my sandbox application on both Windows and directly on our Raspberry Pi where our normal application runs and seen this happen on both platforms.

But when it works and expected results occur when the configureOfflinePublishQueueing(0) setting is recognized, it works good enough. Therefore, I went ahead and applied the configureOfflinePublishQueueing(0) setting to our real application (with MQTT logging turned on) and sure enough, even though the MQTT logger shows Configuring offline requests queueing: max queue size: 0, when I disconnect Wifi, the .publish() is just throwing constant publishTimeoutException exceptions versus the expected 1 publishTimeoutException, many publishQueueDisabledExceptionOffline and this lead to a bunch of those offline published that occurred during the loop until reconnection to publish the same message many times. I then stop my real application, run my sandbox application right then and there on the same Pi, and it works as expected. Try the real application again right away, and it doesn’t not work as expected even though both application have the configureOfflinePublishQueueing(0). So this also part of the inconsistency I was referring to.

There was also one suggestion in a link here where someone suggested emptying the publish queue myself in our application when the .publish() fails. So I tried the one suggestion of emptying the [._mqtt_core._internal_async_client._paho_client._out_messages queue] as well. Strange thing is just like with the configureOfflinePublishQueuing(o) setting, when I run my simple sandbox program right from Visual Code, this solution actually worked and stepping through the code via DEBUG I was emptying the queue. However, when I add the exact same code to our real application, it does not work and still just the same constant publishTimeoutException exceptions.

The one big difference between the sandbox and real application is that in the real application, all of the MQTT business logic is running in its own thread. I could be wrong, but it looks like there’s some thread safe logic in the SDK code that would restrict accessing objects, like this publish queue, from another thread so that could be why this approach doesn't work in the real application.

I also see the suggestion of trying the Python v2 SDK, but in reading open issues for it, it appears trying to control this configure offline publishing type behavior is happening in that version as well to where all messages are queued and there is no setting to control it in v2 like there is in v1. Therefore for us, we would basically still have the same problem as we do currently in v1.

In closing, since our application is an action/reaction type message exchange, you can easily see if messages were published more than one when not expected because our back-end system responds to each so if 2 “LOGIN” messages were published, 2 “SCAN USER ID” messages would be subscribed, so we would get charge for “duplicate” messages and we want to avoid that as well.

Thanks for all your help!

Kirk

aws / aws-iot-device-sdk-python

configureOfflinePublishQueueing(0) setting in AWSIoTMQTTClient object not working. #293

⚠️COMMENT VISIBILITY WARNING⚠️