aws / amazon-freertos

DEPRECATED - See README.md
https://aws.amazon.com/freertos/
MIT License
2.54k stars 1.1k forks source link

IotMqtt_Disconnect sometimes not returning but leaving task blocked forever #3073

Closed lightblu closed 3 years ago

lightblu commented 3 years ago

Hi,

I just tried updating from the 202007.00 release to 202012.00.

We have integrated MQTT libraries to do MQTT on top of our own SOCKETS_ interface implementation. We are still using the MQTT V2.x.x APIs, which seems to be now the "MQTT Compatibility Layer" wrapper to the new coreMQTT library (as I understand it from https://github.com/aws/amazon-freertos/blob/master/libraries/c_sdk/standard/mqtt/README.md - it is quite taxing that this is now the third bigger API move within 2 years after already going from MQTT V1 to V2 - is there any stability on the horizon with the LTS right now? :/ )

Before with the 202007.00 all MQTT functionality was stable, however with the 202012.00 updae we sometimes (but not always) get the situation that IotMqtt_Disconnect does not return but just leaves the calling task hanging forever. Here is an example log, "cloud" is the task using the IotMqtt_x API functions:

2021-03-26 22:23:38 [TRACE] (   cloud): CLOSE MQTT id=1616797339  (our message, before publishing a last message)

2021-03-26 22:23:38 [TRACE] (   cloud): [INFO ][CMQTT][90021] (MQTT connection 0x200162d0) MQTT PUBLISH operation queued.
2021-03-26 22:23:38 [TRACE] (   cloud): [INFO ][CMQTT][90021] (MQTT connection 0x200162d0, PUBLISH operation 0x20016388) Waiting for operation completion.
2021-03-26 22:23:38 [TRACE] (   cloud): [INFO ][CMQTT][90462] (MQTT connection 0x200162d0, PUBLISH operation 0x20016388) Wait complete with result SUCCESS.
2021-03-26 22:23:38 [TRACE] (   cloud): calling IotMqtt_Disconnect() 
2021-03-26 22:23:38 [TRACE] (   cloud): [INFO ][CMQTT][90463] (MQTT connection 0x200162d0) Disconnecting connection.
2021-03-26 22:23:39 [TRACE] (   cloud): SOCKETS_Shutdown[0] 2
2021-03-26 22:23:39 [TRACE] (iot_thre): [ERROR][CMQTT][90860] (MQTT connection 0x200162d0) Failed to receive PINGRESP within 12000 ms.
2021-03-26 22:23:39 [TRACE] (iot_thre): SOCKETS_Shutdown[0] 2
2021-03-26 22:23:39 [TRACE] (   cloud): SOCKETS_Shutdown[0] 2 returned 0
2021-03-26 22:23:39 [TRACE] (   cloud): [INFO ][CMQTT][90862] (MQTT connection 0x200162d0) Network connection closed.
2021-03-26 22:23:39 [TRACE] (   cloud): got disconnected [SUCCESS]    <- this is the disconnect callback function
2021-03-26 22:23:39 [TRACE] (iot_thre): SOCKETS_Shutdown[0] 2 returned 0
2021-03-26 22:23:39 [TRACE] (iot_thre): [INFO ][CMQTT][90898] (MQTT connection 0x200162d0) Network connection closed.
2021-03-26 22:23:39 [TRACE] (iot_thre): got disconnected [INITIALIZATION FAILED]
2021-03-26 22:23:39 [TRACE] (iot_thre): SOCKETS_Close[0]
2021-03-26 22:23:39 [TRACE] (iot_thre): SOCKETS_Close returned [0]
2021-03-26 22:23:39 [TRACE] (iot_thre): [INFO ][CMQTT][90900] (MQTT connection 0x200162d0) Network connection destroyed.

This is what a good disconnect sequence looks like (above the IotMqtt_Disconnect() returned is missing, leaving the cloud task hanging in IotMqtt_Disconnect forever)

2021-03-26 22:35:46 [TRACE] (   cloud): CLOSE MQTT id=1616798064
2021-03-26 22:35:46 [TRACE] (   cloud): [INFO ][CMQTT][93794] (MQTT connection 0x200162d0) MQTT PUBLISH operation queued.
2021-03-26 22:35:46 [TRACE] (   cloud): [INFO ][CMQTT][93795] (MQTT connection 0x200162d0, PUBLISH operation 0x20016388) Waiting for operation completion.
2021-03-26 22:35:47 [TRACE] (   cloud): [INFO ][CMQTT][94291] (MQTT connection 0x200162d0, PUBLISH operation 0x20016388) Wait complete with result SUCCESS.

2021-03-26 22:35:47 [TRACE] (   cloud): calling IotMqtt_Disconnect()
2021-03-26 22:35:47 [TRACE] (   cloud): [INFO ][CMQTT][94292] (MQTT connection 0x200162d0) Disconnecting connection.
2021-03-26 22:35:47 [TRACE] (   cloud): SOCKETS_Shutdown[0] 2
2021-03-26 22:35:47 [TRACE] (   cloud): SOCKETS_Shutdown[0] 2 returned 0
2021-03-26 22:35:47 [TRACE] (   cloud): [INFO ][CMQTT][94724] (MQTT connection 0x200162d0) Network connection closed.
2021-03-26 22:35:47 [TRACE] (   cloud): got disconnected [SUCCESS]  <- this is the disconnect callback function
2021-03-26 22:35:47 [TRACE] (   cloud): SOCKETS_Close[0]
2021-03-26 22:35:47 [TRACE] (   cloud): SOCKETS_Close returned [0]
2021-03-26 22:35:47 [TRACE] (   cloud): [INFO ][CMQTT][94793] (MQTT connection 0x200162d0) Network connection destroyed.
2021-03-26 22:35:47 [TRACE] (   cloud): IotMqtt_Disconnect() returned

What stands out is that in the "bad" case somehow also the iot_thread starts actioning, surprisingly it tells about a failed PINGRESP (which is not true - over the modem communicatin I see the PING around 22:23:27 was successfully replied, and there were multiple successful publishes afterwards, the last you see there also in the log), and then similarly goes and calls SOCKETS_Shutdown and SOCKETS_CLOSE parallel to the calls from our cloud task.

Reading how the new coreMQTT is minimal and not thread safe and needs an agent around it (which for now can be the MQTT V2 now compatiblity wrapper, until the https://github.com/FreeRTOS/Lab-Project-coreMQTT-Agent is matured?), I wonder if it is us doing something wrong, or the MQTT V2 wrapping coreMQTT having a problem. Basically we see here what should not happen, two threads (cloud and iot_thread) suddenly working against coreMQTT?! It seems all our uses of the IotMqtt_x functions, especially the non-returning IotMqttDisconnect are fine, and also our underlying SOCKETS implementation behaves as it should.

Do you have any idea what could be going wrong, or what to try? Or would you recommend to not trying to migrate/upgrade now until https://github.com/FreeRTOS/Lab-Project-coreMQTT-Agent (and also https://github.com/aws/ota-for-aws-iot-embedded-sdk/tree/v3.0.0, which we would similarly need then to get rid of MQTT V2.x.x APIs?) are matured and completely integrated?

Thank you for any advise.

muneebahmed10 commented 3 years ago

Hi @lightblu,

Are you able to enable debug logging in the configuration file with IOT_LOG_DEBUG? Here's an example config file for Windows where logging can be set. That would help in pinpointing where the function is getting stuck.

Thank you

muneebahmed10 commented 3 years ago

Hi @lightblu,

I've merged #3082 which deals with a triggered assert in IotNetworkAfr_Destroy. Can you pull the latest master and test if the issue has been resolved?

lightblu commented 3 years ago

It seems https://github.com/aws/amazon-freertos/issues/2389 is still present for us, but I never figured out what we do wrong in this regard there :/ So I cannot go to IOT_LOG_DEBUG but just try with IOT_LOG_INFO right now, anyway will create a run with latest changes and IOT_LOG_INFO! Thank you.

lightblu commented 3 years ago

Bad news: that now reliably hardfaults for me on calling IotMqtt_Disconnect()

IotStaticMemory_ReturnInUse
$t
{
 0004BFC0   PUSH.W         {R4-R8, LR}
 0004BFC4   LDR            R7, [SP, #24]
 0004BFC6   MOV            R5, R1
 0004BFC8   MOV            R4, R2
( void ) memset( ptr, 0x00, elementSize );
 0004BFCA   MOVS           R1, #0
 0004BFCC   MOV            R2, R7
{
 0004BFCE   MOV            R6, R3
 0004BFD0   MOV            R8, R0
( void ) memset( ptr, 0x00, elementSize );
 0004BFD2   BL             memset                        ; 0x00067330
IotMutex_Lock( &( _mutex ) );
 0004BFD6   LDR            R0, =_mutex                   ; [0x0004C008] =0x2001AEE0
 0004BFD8   BL             IotMutex_Lock                 ; 0x000460D0

LR is pointing me here, PC within that memset, it is an IMPRECISEERR though (ARM M4).

But actually, looking and hacking a bit around, and just adding that to IotStaticMemory_ReturnInUse

void IotStaticMemory_ReturnInUse( void * ptr,
                                  void * pPool,
                                  bool * pInUse,
                                  size_t limit,
                                  size_t elementSize )
{

    if(ptr==NULL){
        IotLogWarn( "riu: %p %i", ptr, elementSize );
        vTaskDelay(300);
        return;
    }

I can remove the hardfault, and on the console I see quite some attempted NULL pointer "frees":

...
2021-04-02 19:58:36 [TRACE] (   cloud): IotMqtt_Disconnect() called
2021-04-02 19:58:36 [TRACE] (   cloud): [INFO ][CMQTT][39334] (MQTT connection 0x20019250) Disconnecting connection.
2021-04-02 19:58:37 [TRACE] (   cloud): SOCKETS_Shutdown[0] 2
2021-04-02 19:58:37 [TRACE] (   cloud): SOCKETS_Shutdown[0] 2 returned 0
2021-04-02 19:58:37 [TRACE] (   cloud): [INFO ][CMQTT][40335] (MQTT connection 0x20019250) Network connection closed.
2021-04-02 19:58:37 [TRACE] (   cloud): got disconnected [SUCCESS]
2021-04-02 19:58:37 [TRACE] (   cloud): [WARN ][S][40336] riu: 0x0 2048
2021-04-02 19:58:38 [TRACE] (   cloud): [WARN ][S][40636] riu: 0x0 2048
2021-04-02 19:58:38 [TRACE] (   cloud): [WARN ][S][40936] riu: 0x0 2048
2021-04-02 19:58:38 [TRACE] (   cloud): [WARN ][S][41236] riu: 0x0 2048
2021-04-02 19:58:38 [TRACE] (   cloud): [WARN ][S][41536] riu: 0x0 2048
2021-04-02 19:58:39 [TRACE] (   cloud): [WARN ][S][41836] riu: 0x0 2048
2021-04-02 19:58:39 [TRACE] (   cloud): [WARN ][S][42136] riu: 0x0 2048
2021-04-02 19:58:39 [TRACE] (   cloud): [WARN ][S][42436] riu: 0x0 2048
2021-04-02 19:58:40 [TRACE] (   cloud): [WARN ][S][42736] riu: 0x0 2048
2021-04-02 19:58:40 [TRACE] (   cloud): SOCKETS_Close[0]
2021-04-02 19:58:40 [TRACE] (   cloud): SOCKETS_Close returned [0]
2021-04-02 19:58:40 [TRACE] (   cloud): [INFO ][CMQTT][43037] (MQTT connection 0x20019250) Network connection destroyed.
2021-04-02 19:58:40 [TRACE] (   cloud): IotMqtt_Disconnect() returned
...

(likely obvious now: we are running with IOT_STATIC_MEMORY_ONLY )

lightblu commented 3 years ago

For now I couldn't reproduce the described behavior anymore (with latest version as of now, including #3082 and #3100), thus closing.