FreeRTOS / iot-reference-esp32

MIT License
81 stars 49 forks source link

[BUG] coreMQTT doesn't reconnect and keep alive handling fails #91

Closed NightSkySK closed 3 months ago

NightSkySK commented 4 months ago

Describe the bug

At first glance, the subject can suggest that it is the same issue as described in https://github.com/FreeRTOS/iot-reference-esp32c3/issues/47 and https://github.com/FreeRTOS/iot-reference-esp32c3/issues/48 However, as the symptoms are the same the solution from those issues haven't solve my problem. Let me explain step by step what happened and my observations and logs (It will be quite a long thread, I'm sorry...)

The story begins when I found that my code is suffering from mbedtls memory leak: In logs I found esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 which was clear indication mbedtls memory leak

2024-07-22 09:01:59 
I (38176357) coreMQTT: Publishing message to /gates/GA97DF0004.

2024-07-22 09:02:00 
E (38177577) coreMQTT: sendMessageVector: Unable to send packet: Network Error.
E (38177577) coreMQTT: MQTT PUBLISH failed with status MQTTSendFailed.
E (38177577) coreMQTT: MQTT operation failed with status MQTTSendFailed

W (38177577) gate_control: Error or timed out waiting for ack for publish message 12420. Re-attempting publish.
I (38177587) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (38177587) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1664

I (38177607) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38177597) core_mqtt_agent_manager: TLS connection was disconnected.
I (38177627) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38177597) gate_control: Task "GateReport" sending publish request to coreMQTT-Agent with message "{"openReceived": 0,"gateSensors":{ "IN1": 0, "IN2": 0},"keyboard":{ "new_key": 0, "key": ""}, "iteration": 4135}" on topic "/gates/GA97DF0004" with ID 12420.
I (38177647) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (38177667) gate_control: Task "GateReport" waiting for publish 12420 to complete.
I (38177677) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-22 09:02:00 
E (38177947) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38177947) esp-tls: create_ssl_handle failed
E (38177947) esp-tls: Failed to open new connection

2024-07-22 09:02:01 
I (38178267) core_mqtt_agent_manager: Retry attempt 1.

2024-07-22 09:02:01 
E (38178487) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38178487) esp-tls: create_ssl_handle failed
E (38178487) esp-tls: Failed to open new connection

2024-07-22 09:02:01 
W (38178577) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (38178577) AWS_OTA: OTA Agent is suspended.
I (38178577) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]

2024-07-22 09:02:01 
I (38178947) core_mqtt_agent_manager: Retry attempt 2.

2024-07-22 09:02:01 
E (38179077) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38179077) esp-tls: create_ssl_handle failed
E (38179077) esp-tls: Failed to open new connection

2024-07-22 09:02:03 I (38180557) core_mqtt_agent_manager: Retry attempt 3.
E (38180707) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38180717) esp-tls: create_ssl_handle failed
E (38180717) esp-tls: Failed to open new connection

In the same software version I noticed that MQTTKeepAliveTimeout appear from time to time but the software can handle reconnection without issue:

2024-07-21 16:33:28 
I (18501007) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 6002

2024-07-21 16:33:36 
E (18508867) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (18508867) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (18508867) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout
I (18508877) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (18508887) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (18508877) core_mqtt_agent_manager: TLS connection was disconnected.
I (18508907) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (18508927) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (18508927) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-21 16:33:38 
I (18511227) coreMQTT: MQTT connection established with the broker.
I (18511227) core_mqtt_agent_manager: Session present: 0

I (18511227) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005/update will be attempted.
W (18511227) gate_control: Error or timed out waiting for ack to subscribe message 6002. Re-attempting subscribe.
I (18511237) report: coreMQTT-Agent connected.
I (18511257) gate_control: coreMQTT-Agent connected.
I (18511257) supervisor: coreMQTT-Agent connected.
I (18511257) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 6002
I (18511267) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (18511287) core_mqtt_agent_manager: coreMQTT-Agent connected.

So to fix mbedtls memory issue I've applied following commits: https://github.com/FreeRTOS/iot-reference-esp32c3/commit/a0fe220e48d274e4b724d50109f45552fa98a58d referring to sdkconfig.default and https://github.com/FreeRTOS/iot-reference-esp32c3/commit/e2d407f72c20202730e7ac641506559fc36eee43 referring to main/networking/mqtt/core_mqtt_agent_manager.c

And it worked well for mbedtls memory issue, I can't find any evidence esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 anymore, however the frequency of MQTTKeepAliveTimeout significantly increased and software is no longer capable to recover MQTT connection.

2024-07-24 01:30:43 
I (7749746) coreMQTT: Publishing message to /gates/GA97DF0004.

2024-07-24 01:30:46 
I (7753386) ota_over_mqtt: OTA agent resumed by timer.

2024-07-24 01:30:48 
I (7754386) AWS_OTA: otaPal_GetPlatformImageState
I (7754386) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (7754386) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (7754386) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]

2024-07-24 01:30:49 
I (7755416) gate_control: Task "GateControl" iteration 128 completed.

2024-07-24 01:30:49 
E (7755746) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (7755746) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (7755746) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (7755756) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (7755766) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (7755786) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (7755796) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (7755806) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-24 01:30:50 W (7756396) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
2024-07-24 01:32:11 E (7837816) coreMQTT: No command structure available.
2024-07-24 01:32:21 E (7847866) coreMQTT: No command structure available.
2024-07-24 01:32:31 E (7857916) coreMQTT: No command structure available.
2024-07-24 01:32:41 E (7867966) coreMQTT: No command structure available.

And on this stage we have two issues:

  1. Why MQTTKeepAliveTimeout even if few seconds earlier there was successfully published msg to AWS IoT MQTT server and from AWS IoT Documantation page I can read:

For MQTT (or MQTT over WebSockets) connections, a client can request a keep-alive interval between 30 - 1200 seconds as part of the MQTT CONNECT message. AWS IoT starts the keep-alive timer for a client when sending CONNACK in response to the CONNECT message. This timer is reset whenever AWS IoT receives a PUBLISH, SUBSCRIBE, PING, or PUBACK message from the client. AWS IoT will disconnect a client whose keep-alive timer has reached 1.5x the specified keep-alive interval (i.e., by a factor of 1.5).The default keep-alive interval is 1200 seconds. If a client requests a keep-alive interval of zero, the default keep-alive interval will be used. If a client requests a keep-alive interval greater than 1200 seconds, the default keep-alive interval will be used. If a client requests a keep-alive interval shorter than 30 seconds but greater than zero, the server treats the client as though it requested a keep-alive interval of 30 seconds.

where key information is This timer is reset whenever AWS IoT receives a PUBLISH, SUBSCRIBE, PING, or PUBACK message from the client.

I've also tried to increase in sdkconfig.default from CONFIG_GRI_MQTT_AGENT_KEEP_ALIVE_INTERVAL_SECONDS=60 to CONFIG_GRI_MQTT_AGENT_KEEP_ALIVE_INTERVAL_SECONDS=600 without any major difference

which I found confirmation at the server side as Livecycle Connect/Disconnect events don't show MQTT_KEEP_ALIVE_TIMEOUT but DUPLICATE_CLIENTID caused by rebooting ESP32 without proper disconnection with MQTT server.

  1. The application lost the capability to recover from coreMQTT-Agent disconnected my short-term solution is to trigger the device reboot by my supervisor task, it isn't an elegant solution and I would prefer to fix it properly.

System information

NightSkySK commented 4 months ago

I've just revert changes introduced by commit https://github.com/FreeRTOS/iot-reference-esp32c3/commit/e2d407f72c20202730e7ac641506559fc36eee43

and as long as it doesn't solve MQTTKeepAliveTimeout software is again capable of recovering MQTT connection without necessary rebooting the device

2024-07-24 12:08:17 I (343095) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 110

2024-07-24 12:08:24 E (350095) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (350095) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (350095) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (350105) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (350105) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1896

I (350115) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (350105) core_mqtt_agent_manager: TLS connection was disconnected.
I (350135) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (350145) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (350155) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-24 12:08:27 I (352415) coreMQTT: MQTT connection established with the broker.
I (352415) core_mqtt_agent_manager: Session present: 0

I (352415) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005/update will be attempted.
W (352415) gate_control: Error or timed out waiting for ack to subscribe message 110. Re-attempting subscribe.
I (352435) report: coreMQTT-Agent connected.
I (352445) gate_control: coreMQTT-Agent connected.
I (352455) supervisor: coreMQTT-Agent connected.
I (352455) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 110
I (352455) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (352475) core_mqtt_agent_manager: coreMQTT-Agent connected.

2024-07-24 12:08:29 I (355055) gate_control: Subscribe 110 for topic filter /gates/GA97DF0005 succeeded for task "GateReport".

Unfortunately without those changes also E (38628625) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 returns to be an issue:

I (38625455) gate_control: Task "GateReport" sending unsubscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0004 with id 12502

2024-07-24 22:46:00 E (38627905) coreMQTT: sendMessageVector: Unable to send packet: Network Error.
W (38627905) gate_control: Error or timed out waiting for ack to unsubscribe message 12502. Re-attempting subscribe.
E (38627905) coreMQTT: MQTT operation failed with status MQTTSendFailed

I (38627925) gate_control: Task "GateReport" sending unsubscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0004 with id 12502
I (38627925) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (38627925) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1708

I (38627945) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38627925) core_mqtt_agent_manager: TLS connection was disconnected.
I (38627965) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38627985) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (38627995) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-24 22:46:02 E (38628625) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38628635) esp-tls: create_ssl_handle failed
E (38628635) esp-tls: Failed to open new connection
W (38628695) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (38628695) AWS_OTA: OTA Agent is suspended.
I (38628695) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
I (38628945) core_mqtt_agent_manager: Retry attempt 1.

2024-07-24 22:46:02 E (38629155) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38629155) esp-tls: create_ssl_handle failed
E (38629155) esp-tls: Failed to open new connection

2024-07-24 22:46:02 I (38629615) core_mqtt_agent_manager: Retry attempt 2.

2024-07-24 22:46:03 E (38629715) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38629715) esp-tls: create_ssl_handle failed
E (38629715) esp-tls: Failed to open new connection

2024-07-24 22:46:04 I (38631185) core_mqtt_agent_manager: Retry attempt 3.
E (38631405) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38631405) esp-tls: create_ssl_handle failed
E (38631405) esp-tls: Failed to open new connection

2024-07-24 22:46:06 I (38633155) core_mqtt_agent_manager: Retry attempt 4.
ActoryOu commented 4 months ago

Hi @NightSkySK, Thanks for your feedback!

1. Why MQTTKeepAliveTimeout even if few seconds earlier there was successfully published msg to
   AWS IoT MQTT server and from AWS IoT Documantation page I can read:

My personal guess is that the ping request is sent from device side and it's waiting for ping response. In this case, even though we've sent the publish message successfully, this scenario will also happen (refer to this line). Let me trigger an internal discussion to see if we can have better solution!

2. The application lost the capability to recover from coreMQTT-Agent disconnected
   my short-term solution is to trigger the device reboot by my supervisor task, it isn't
   an elegant solution and I would prefer to fix it properly.

I'm looking into this, will back later once I found something.

Thank you!

NightSkySK commented 4 months ago

Hi @ActoryOu thank you for looking into this issue.

My personal guess is that the ping request is sent from device side and it's waiting for ping response. In this case, even though we've sent the publish message successfully, this scenario will also happen (refer to this line). Let me trigger an internal discussion to see if we can have better solution!

Maybe the MQTT_PINGRESP_TIMEOUT_MS is too short for weaker network conditions? To investigate this issue I'm running simultaneously 6 devices, of which 3 are connected to IoT WiFI network where the router is supplied with 300Mbit/s fibre internet and there I don't see that MQTTKeepAliveTimeout was so frequently or even at all, much worse situation is with three other devices which are connected to TP-LINK Archer MR200 4G LTE where internet is taken from cellular LTE network. I haven't compared the pings in those networks, but I can imagine that LTE one has higher pings than the fibre one.

ActoryOu commented 4 months ago
2. The application lost the capability to recover from coreMQTT-Agent disconnected
   my short-term solution is to trigger the device reboot by my supervisor task, it isn't
   an elegant solution and I would prefer to fix it properly.

Looks like the CORE_MQTT_AGENT_DISCONNECTED_BIT in event group is cleared somehow, which is not expected. Would you mind adding some logs to track that? Please help provide device log in error period. Here are some places to add logs.

ActoryOu commented 4 months ago

Maybe the MQTT_PINGRESP_TIMEOUT_MS is too short for weaker network conditions? To investigate this issue I'm running simultaneously 6 devices, of which 3 are connected to IoT WiFI network where the router is supplied with 300Mbit/s fibre internet and there I don't see that MQTTKeepAliveTimeout was so frequently or even at all, much worse situation is with three other devices which are connected to TP-LINK Archer MR200 4G LTE where internet is taken from cellular LTE network. I haven't compared the pings in those networks, but I can imagine that LTE one has higher pings than the fibre one.

This is definitely possible. You can try enlarge it as workaround.

NightSkySK commented 4 months ago

@ActoryOu

I've implemented all logging requested:

2. The application lost the capability to recover from coreMQTT-Agent disconnected
   my short-term solution is to trigger the device reboot by my supervisor task, it isn't
   an elegant solution and I would prefer to fix it properly.

Looks like the CORE_MQTT_AGENT_DISCONNECTED_BIT in event group is cleared somehow, which is not expected. Would you mind adding some logs to track that? Please help provide device log in error period. Here are some places to add logs.

As in normal condition each while loop iteration on L777 it was printing: W (167796) TAG: L777 xEventGroupWaitBits: 5

I've made such condition in code:

        if( eMqttRet == MQTTSuccess )
        {
            while( 1 )
            {
                xReturnedEventBits_2 = xEventGroupWaitBits(
                    xNetworkEventGroup,
                    CORE_MQTT_AGENT_DISCONNECTED_BIT,
                    pdFALSE,
                    pdFALSE,
                    0 );
                if( xReturnedEventBits_2 != 5 )
                {
                    ESP_LOGW( "TAG",
                              "L777 xEventGroupWaitBits: %lx",
                              xReturnedEventBits_2 );
                }
                if( xReturnedEventBits_2 == CORE_MQTT_AGENT_DISCONNECTED_BIT )
                {
                    break;
                }
                fd_set readSet;
                fd_set errorSet;

And here are the logs when issue appear:

2024-07-25 12:57:06 I (3959356) gate_control: Unsubscribe 1777 for topic filter /gates/GA97DF0006 succeeded for task "GateReport".
I (3959356) gate_control: Task "GateReport" iteration 591 completed.

2024-07-25 12:57:10 I (3962626) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0006 with id 1778

2024-07-25 12:57:15 E (3967626) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (3967626) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (3967626) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (3967636) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (3967636) core_mqtt_agent_manager: L533 Set disconnect bit.
I (3967646) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (3967666) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (3967656) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1740

I (3967676) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
W (3967646) TAG: L777 xEventGroupWaitBits: 9
I (3967686) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
2024-07-25 12:57:15 W (3967766) TAG: L777 xEventGroupWaitBits: 9
W (3967836) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15 W (3967906) TAG: L777 xEventGroupWaitBits: 9
W (3967976) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15 W (3968046) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15 W (3968116) TAG: L777 xEventGroupWaitBits: 9
W (3968146) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (3968146) AWS_OTA: OTA Agent is suspended.
I (3968146) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
W (3968186) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15 W (3968256) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15 W (3968326) TAG: L777 xEventGroupWaitBits: 9
W (3968396) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:20 I (3973536) gate_control: Task "GateControl" iteration 65 completed.
2024-07-25 12:57:25 W (3978446) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:35 W (3988496) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:45 W (3998546) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:56 W (4008596) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:58:06 W (4018646) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:58:16 W (4028696) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:58:17 E (4030026) supervisor: GateControl task i
2024-07-25 12:58:17 s not running

So even if xEventGroupWaitBits: 5 changed to xEventGroupWaitBits: 9 the condition

              if( xReturnedEventBits_2 == CORE_MQTT_AGENT_DISCONNECTED_BIT )
                {
                    break;
                }

didn't break the loop which seems to be reasonable as we have defined:

#define WIFI_CONNECTED_BIT               ( 1 << 0 )
#define WIFI_DISCONNECTED_BIT            ( 1 << 1 )
#define CORE_MQTT_AGENT_CONNECTED_BIT    ( 1 << 2 )
#define CORE_MQTT_AGENT_DISCONNECTED_BIT ( 1 << 3 )

So 5 mean WIFI_CONNECTED (1) and CORE_MQTT_AGENT_CONNECTED (4), isn't it? Than flipping bits 2 & 3:

                        xEventGroupClearBits( xNetworkEventGroup,
                                              CORE_MQTT_AGENT_CONNECTED_BIT );
                        xEventGroupSetBits( xNetworkEventGroup,
                                            CORE_MQTT_AGENT_DISCONNECTED_BIT );

we get 9 as CORE_MQTT_AGENT_DISCONNECTED_BIT (8) and WIFI_CONNECTED_BIT(1) is still enabled. This mean that the if condition to beak the loop should be something like that:

                if( ( xReturnedEventBits_2 &
                      CORE_MQTT_AGENT_DISCONNECTED_BIT ) == 8 )
                {
                    break;
                }

or

                if( ( xReturnedEventBits_2 &
                      CORE_MQTT_AGENT_DISCONNECTED_BIT ) != 0 )
                {
                    break;
                }

than L777 should look something more like this:

while( (xEventGroupWaitBits( xNetworkEventGroup, CORE_MQTT_AGENT_DISCONNECTED_BIT, pdFALSE, pdFALSE, 0 )& CORE_MQTT_AGENT_DISCONNECTED_BIT) != 0 )

I'm not very experienced with bitwise operations, please correct me if I'm wrong.

NightSkySK commented 4 months ago

Once I implemented this change:

               if( ( xReturnedEventBits_2 &
                    CORE_MQTT_AGENT_DISCONNECTED_BIT ) != 0 )
               {
                  break;
               }

Now it looks much better:

2024-07-25 17:54:04 I (13997266) gate_control: Task "GateControl" iteration 232 completed.

2024-07-25 17:54:04 I (13997786) coreMQTT: Publishing message to /gates/GA97DF0005.

2024-07-25 17:54:11 E (14004786) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (14004786) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (14004786) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (14004796) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (14004796) core_mqtt_agent_manager: L533 Set disconnect bit.
I (14004806) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (14004816) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1880

I (14004826) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (14004816) core_mqtt_agent_manager: L777 xEventGroupWaitBits: 9
I (14004846) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
W (14004846) core_mqtt_agent_manager: L712 xEventGroupWaitBits: 9
I (14004856) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (14004866) core_mqtt_agent_manager: TLS connection was disconnected.

2024-07-25 17:54:11 W (14004976) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (14004976) AWS_OTA: OTA Agent is suspended.
I (14004976) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]

2024-07-25 17:54:14 I (14007516) coreMQTT: MQTT connection established with the broker.
I (14007526) core_mqtt_agent_manager: Session present: 0

I (14007526) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005/update will be attempted.
W (14007526) gate_control: Error or timed out waiting for ack for publish message 6271. Re-attempting publish.
I (14007536) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005 will be attempted.
W (14007556) core_mqtt_agent_manager: L772 Clear disconnect bit 
I (14007556) report: coreMQTT-Agent connected.
W (14007556) core_mqtt_agent_manager: L509 xEventGroupWaitBits: 5
I (14007566) gate_control: coreMQTT-Agent connected.
I (14007576) supervisor: coreMQTT-Agent connected.
I (14007586) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (14007576) gate_control: Task "GateReport" sending publish request to coreMQTT-Agent with message "{"openReceived": 0,"gateSensors":{ "IN1": 0, "IN2": 0},"keyboard":{ "new_key": 0, "key": ""}, "iteration": 2088}" on topic "/gates/GA97DF0005" with ID 6271.
I (14007596) core_mqtt_agent_manager: coreMQTT-Agent connected.
I (14007616) gate_control: Task "GateReport" waiting for publish 6271 to complete.

2024-07-25 17:54:14 I (14007976) AWS_OTA: otaPal_GetPlatformImageState
I (14007976) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (14007976) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (14007976) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]

2024-07-25 17:54:16 I (14009036) coreMQTT: Publishing message to /gates/GA97DF0005.
JasonYan324 commented 4 months ago

Maybe the MQTT_PINGRESP_TIMEOUT_MS is too short for weaker network conditions? To investigate this issue I'm running simultaneously 6 devices, of which 3 are connected to IoT WiFI network where the router is supplied with 300Mbit/s fibre internet and there I don't see that MQTTKeepAliveTimeout was so frequently or even at all, much worse situation is with three other devices which are connected to TP-LINK Archer MR200 4G LTE where internet is taken from cellular LTE network. I haven't compared the pings in those networks, but I can imagine that LTE one has higher pings than the fibre one.

This is definitely possible. You can try enlarge it as workaround.

Can I ask MQTT cloud reconnect itself when wifi reconnect?In my test I foud that when wifi disconnect and reconnect itself,MQTT client can not reconnect itself. here are some log.

I (36611157) wifi:bcn_timeout,ap_probe_send_start [0;32mI (36611317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m [0;31mE (36612037) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout[0m [0;31mE (36612037) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout[0m [0;31mE (36612037) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout [0m [0;32mI (36612037) core_mqtt_agent_manager: coreMQTT-Agent disconnected.[0m [0;32mI (36612037) sub_pub_unsub_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.[0m [0;32mI (36612037) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent.[0m [0;32mI (36612037) temp_sub_pub_and_led_control_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.[0m [0;33mW (36612317) AWS_OTA: OTA Timer handle NULL for Timerid=0, can't stop.[0m [0;33mW (36612317) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.[0m [0;32mI (36612317) AWS_OTA: OTA Agent is suspended.[0m [0;32mI (36612317) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended][0m [0;32mI (36612317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m [0;32mI (36613317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m I (36613657) wifi:ap_probe_send over, resett wifi status to disassoc I (36613657) wifi:state: run -> init (c800) I (36613657) wifi:pm stop, total sleep time: 31072080977 us / 36610841828 us I (36613657) wifi:idx:1, tid:0 I (36613657) wifi:idx:0, tid:4 I (36613657) wifi:new:<6,0>, old:<6,0>, ap:<255,255>, sta:<6,0>, prof:1 [0;32mI (36613667) core_mqtt_agent_manager: WiFi disconnected.[0m [0;32mI (36613667) app_wifi: Disconnected. Connecting to the AP again...[0m [0;32mI (36614317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m [0;32mI (36615317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m I (36616077) wifi:new:<1,0>, old:<6,0>, ap:<255,255>, sta:<1,0>, prof:1 I (36616077) wifi:state: init -> auth (b0) I (36616077) wifi:state: auth -> assoc (0) I (36616107) wifi:state: assoc -> run (10) I (36616117) wifi:connected with test123, aid = 17, channel 1, BW20, bssid = f6:84:8d:1b:33:62 I (36616117) wifi:security: WPA2-PSK, phy: bgn, rssi: -67 I (36616137) wifi:pm start, type: 1 I (36616137) wifi:set rx beacon pti, rx_bcn_pti: 14, bcn_timeout: 25000, mt_pti: 14, mt_time: 10000 I (36616167) wifi:AP's beacon interval = 102400 us, DTIM period = 1 I (36616177) wifi:idx:0 (ifx:0, f6:84:8d:1b:33:62), tid:4, ssn:2, winSize:64 [0;32mI (36616317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m [0;32mI (36617137) core_mqtt_agent_manager: WiFi connected.[0m [0;32mI (36617137) app_wifi: Connected with IP Address:192.168.2.158[0m [0;32mI (36617137) esp_netif_handlers: sta ip: 192.168.2.158, mask: 255.255.255.0, gw: 192.168.2.1[0m [0;32mI (36617317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m [0;31mE (36713767) coreMQTT: No command structure available.[0m [0;32mI (36714317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m

NightSkySK commented 4 months ago

@JasonYan324 it all depends which version of main/networking/mqtt/core_mqtt_agent_manager.c file you are using, If your version is up to date with main branch than it's a case which we investigate here. My proposal for fix is make change in L777 from xEventGroupWaitBits to following code:

while( (xEventGroupWaitBits( xNetworkEventGroup, CORE_MQTT_AGENT_DISCONNECTED_BIT, pdFALSE, pdFALSE, 0 )& CORE_MQTT_AGENT_DISCONNECTED_BIT) != 0 )

There is quite big chance that it will help with your case as well. Please let us know if that works for you as I'm about to prepare the Pull Request with that solution

JasonYan324 commented 4 months ago

it all depends which version of main/networking/mqtt/core_mqtt_agent_manager.c file you are using, If your version is up to date with main branch than it's a case which we investigate here. My proposal for fix is make change in L777 from xEventGroupWaitBits to following code:

my SDK version is 5.2.2, not the new. Today, I saw the SDK is updtate to 5.3.

one more thing, even if wifi does not diconnect, MQTT keep alive failed and disconnect, but can not be reconnect itselt.

here are some log:

I (5357267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5357297) sub_pub_unsub_demo: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /Meshed/Jason/404CCA89C184/Filter/SubPub0 with id 247 I (5358267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5358837) coreMQTT: Ack packet deserialized with result: MQTTSuccess. I (5358847) coreMQTT: State record updated. New state=MQTTPublishDone. I (5359267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5360267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5361267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5362267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5363267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 E (5363847) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout E (5363847) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout E (5363847) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout  I (5363847) core_mqtt_agent_manager: coreMQTT-Agent disconnected. I (5363847) sub_pub_unsub_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued. I (5363847) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent. I (5363847) temp_sub_pub_and_led_control_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued. W (5364267) AWS_OTA: OTA Timer handle NULL for Timerid=0, can't stop. W (5364267) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop. I (5364267) AWS_OTA: OTA Agent is suspended. I (5364267) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended] I (5452267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5453267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5454267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5455267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5456267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5457267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5458267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5459267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 E (5460097) coreMQTT: No command structure available. I (5478267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5479267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 E (5480117) coreMQTT: No command structure available. I (5487267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5488267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5489267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 E (5490127) coreMQTT: No command structure available. I (5498267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5499267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 E (5500137) coreMQTT: No command structure available. I (5509267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 E (5510147) coreMQTT: No command structure available. I (5510267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0 I (5511267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0

NightSkySK commented 4 months ago

I see the solution is merged with this commit https://github.com/FreeRTOS/iot-reference-esp32c3/commit/36492fc017089ff3f03f9513374f842948fdfdf1

JasonYan324 commented 4 months ago

I see the solution is merged with this commit 36492fc

It seems, the commit 36492fc have a lot of problems

NightSkySK commented 4 months ago

as I mentioned in first e-mail I've applied changes from following commits: https://github.com/FreeRTOS/iot-reference-esp32c3/commit/a0fe220e48d274e4b724d50109f45552fa98a58d referring to sdkconfig.default and https://github.com/FreeRTOS/iot-reference-esp32c3/commit/e2d407f72c20202730e7ac641506559fc36eee43 referring to main/networking/mqtt/core_mqtt_agent_manager.c mainly to fix esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 which was clear indication mbedtls memory leak. In the mean time we found that coreMQTT can't reestablish a connection after error (e.g. due MQTTKeepAliveTimeout) which was solved with 36492fc . How ever after all those changes I find out that esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 still appear again after about 11 hours running the program.

2024-07-29 08:24:04 E (39396617) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (39396617) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (39396617) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (39396627) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (39396627) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1816

I (39396637) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (39396627) core_mqtt_agent_manager: TLS connection was disconnected.
I (39396657) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (39396667) core_mqtt_agent_manager: Internal heap size: 57260bytes
I (39396677) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
W (39396677) core_mqtt_agent_manager: Minimum free heap size: 11924 bytes
I (39396687) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
W (39396737) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (39396737) AWS_OTA: OTA Agent is suspended.
I (39396737) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]

2024-07-29 08:24:04 E (39397197) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39397197) esp-tls: create_ssl_handle failed
E (39397197) esp-tls: Failed to open new connection

2024-07-29 08:24:04 I (39397517) core_mqtt_agent_manager: Retry attempt 1.

2024-07-29 08:24:05 E (39397707) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39397717) esp-tls: create_ssl_handle failed
E (39397717) esp-tls: Failed to open new connection
I (39398167) core_mqtt_agent_manager: Retry attempt 2.

2024-07-29 08:24:05 E (39398427) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39398427) esp-tls: create_ssl_handle failed
E (39398427) esp-tls: Failed to open new connection

2024-07-29 08:24:07 I (39399907) core_mqtt_agent_manager: Retry attempt 3.

2024-07-29 08:24:07 E (39400077) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39400077) esp-tls: create_ssl_handle failed
E (39400077) esp-tls: Failed to open new connection

2024-07-29 08:24:09 I (39401837) core_mqtt_agent_manager: Retry attempt 4.
E (39402017) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39402017) esp-tls: create_ssl_handle failed
E (39402017) esp-tls: Failed to open new connection

2024-07-29 08:24:11 I (39404217) core_mqtt_agent_manager: Retry attempt 5.

2024-07-29 08:24:11 E (39404387) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39404387) esp-tls: create_ssl_handle failed
E (39404387) esp-tls: Failed to open new connection

2024-07-29 08:24:16 I (39409227) core_mqtt_agent_manager: Retry attempt 6.

2024-07-29 08:24:16 E (39409417) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39409427) esp-tls: create_ssl_handle failed
E (39409427) esp-tls: Failed to open new connection

2024-07-29 08:24:17 I (39410127) core_mqtt_agent_manager: Retry attempt 7.

2024-07-29 08:24:17 E (39410307) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39410307) esp-tls: create_ssl_handle failed
E (39410307) esp-tls: Failed to open new connection

2024-07-29 08:24:17 E (39410527) supervisor: Connection can't be established

Could you please advice any more steps to debug reasons of memory leak causing finally to esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00?

ActoryOu commented 4 months ago

Hello @NightSkySK, From your log, you've added extra info like core_mqtt_agent_manager: Internal heap size: 57260bytes and Minimum free heap size: 11924 bytes. Could you help attach the entire log? I'd like to narrow down the scenario.

Thank you.

ActoryOu commented 4 months ago

In the meantime, I will try to perform a code review to see if there are any clues. BTW, what change did you do? Could you help share your testing branch?

ActoryOu commented 4 months ago

Hi @NightSkySK, I'm still having trouble identifying the potential cause of the heap memory leakage based on the information provided. Could you please share the demo configuration you're using? Have you made any modifications to the demo, or are you using the demo with only configuration changes? That additional context would be helpful in assisting you with this issue.

Thank you for your cooperation.

NightSkySK commented 4 months ago

@ActoryOu This week I'm travelling and not able to provide more evidence. I need to include more logs like Minimum free heap size: into the code to trace it in the time of code running, so far I have it only in this one place. I can invite you to my repo however it's already quite heavily modified to adjust to my needs.

WilliamFrasson commented 3 months ago

Hello I face the same issues with my ESP32-S3

System information Hardware board: ESP32-S3 (ESP32 S3 DevKitC 1 N16R8 ESP32 S3 WROOM1 N16R8) ESP-IDF: 5.2.2 ESP-IDE 3.0.0 Operating System: Windows

I (232435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (233435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (234435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (235435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (236115) main: Send TOPIC: ESP32S3_TEST1/out/memcheck MESSAGE: {"xxxxx"}
I (236135) MQTT_UTILS: Sending publish request Topic: "ESP32S3_TEST1/out/memcheck": "{"xxxx"}"  with ID 24.
I (236155) coreMQTT: Publishing message to ESP32S3_TEST1/out/memcheck.

I (236165) MQTT_UTILS: Task "APPTASK" waiting for publish 24 to complete.
I (236175) MQTT_UTILS: Publish 24 succeeded for task "APPTASK".
I (236435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0

E (237165) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (237165) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (237165) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (237175) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent.
I (237185) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (237185) MQTT_UTILS: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (237435) ota_over_mqtt_demo: Call OTA_Suspend...
W (237435) AWS_OTA: OTA Timer handle NULL for Timerid=0, can't stop.
W (237435) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (237435) AWS_OTA: OTA Agent is suspended.
I (237445) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
I (237445) ota_over_mqtt_demo: Call OTA_Suspend...OK
I (237455) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (238465) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (239465) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
JasonYan324 commented 3 months ago

Hello I face the same issues with my ESP32-S3

System information Hardware board: ESP32-S3 (ESP32 S3 DevKitC 1 N16R8 ESP32 S3 WROOM1 N16R8) ESP-IDF: 5.2.2 ESP-IDE 3.0.0 Operating System: Windows

I (232435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (233435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (234435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (235435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (236115) main: Send TOPIC: ESP32S3_TEST1/out/memcheck MESSAGE: {"xxxxx"}
I (236135) MQTT_UTILS: Sending publish request Topic: "ESP32S3_TEST1/out/memcheck": "{"xxxx"}"  with ID 24.
I (236155) coreMQTT: Publishing message to ESP32S3_TEST1/out/memcheck.

I (236165) MQTT_UTILS: Task "APPTASK" waiting for publish 24 to complete.
I (236175) MQTT_UTILS: Publish 24 succeeded for task "APPTASK".
I (236435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0

E (237165) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (237165) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (237165) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (237175) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent.
I (237185) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (237185) MQTT_UTILS: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (237435) ota_over_mqtt_demo: Call OTA_Suspend...
W (237435) AWS_OTA: OTA Timer handle NULL for Timerid=0, can't stop.
W (237435) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (237435) AWS_OTA: OTA Agent is suspended.
I (237445) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
I (237445) ota_over_mqtt_demo: Call OTA_Suspend...OK
I (237455) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (238465) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (239465) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0

you can update the git source code,them fix this issue, but “keep alive failed” will still happens,what is new is that coreMQTT-Agent can be reconnect itself. But, even worse: in my test ota_over_mqtt_demo can not run, it will cause reboot, I have to close it for test up to now. I still waiting for them to feedback

WilliamFrasson commented 3 months ago

Hello, update source code and as you already said

"Handling of keep alive failed. Status=MQTTKeepAliveTimeout" still happen but now seem that ESP32-s3 reset as you can see in the firts lines of logs here below:

I (1099585) main: Send TOPIC: ESP32S3_TEST1/out/memcheck MESSAGE: {"xxxxxx"}
I (1099605) MQTT_UTILS: Sending publish request Topic: "ESP32S3_TEST1/out/memcheck": "xxxxxx"  with ID 139.
I (1099635) MQTT_UTILS: Task "APPTASK" waiting for publish 139 to complete.
I (1101275) MQTT_UTILS: Publish 139 succeeded for task "APPTASK".

E (1107325) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (1107325) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout

ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0x8 (TG1WDT_SYS_RST),boot:0x8 (SPI_FAST_FLASH_BOOT)
Saved PC:0x42004d7b
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3810,len:0x11b4
load:0x403c9700,len:0x4
load:0x403c9704,len:0xab4
load:0x403cc700,len:0x2c14
entry 0x403c9890
I (261) cpu_start: Multicore app
I (271) cpu_start: Pro cpu start user code
I (271) cpu_start: cpu freq: 160000000 Hz
I (271) cpu_start: Application information:
I (274) cpu_start: Project name:     smartmeter
I (279) cpu_start: App version:      v202212.00-64-g06f5ce1-dirty
I (286) cpu_start: Compile time:     Aug  1 2024 08:04:32
I (292) cpu_start: ELF file SHA256:  57fe881d4...
I (297) cpu_start: ESP-IDF:          v5.2.2-dirty
I (303) cpu_start: Min chip rev:     v0.0
I (307) cpu_start: Max chip rev:     v0.99
I (312) cpu_start: Chip rev:         v0.2
I (317) heap_init: Initializing. RAM available for dynamic allocation:
I (324) heap_init: At 3FCABFE8 len 0003D728 (245 KiB): RAM
I (330) heap_init: At 3FCE9710 len 00005724 (21 KiB): RAM
I (336) heap_init: At 3FCF0000 len 00008000 (32 KiB): DRAM
I (343) heap_init: At 600FE014 len 00001FD4 (7 KiB): RTCRAM
I (350) spi_flash: detected chip: generic
I (354) spi_flash: flash io: dio
W (357) spi_flash: Detected size(16384k) larger than the size in the binary image header(4096k). Using the size in the binary image header.
W (371) mcpwm(legacy): legacy driver is deprecated, please migrate to `driver/mcpwm_prelude.h`
I (380) sleep: Configure to isolate all GPIO pins in sleep state
I (387) sleep: Enable automatic switching of GPIO sleep configuration
I (394) coexist: coex firmware version: d96c1e51f
I (399) coexist: coexist rom version e7ae62f
I (405) main_task: Started on CPU0
I (415) main_task: Calling app_main()
I (415) main: ------ APPLICATION START ----------
I (415) main:
EWC_S3 WFR ESP-IDF Version = v5.2.2-dirty

I (425) main: Get Device Credentil From FLASH Initial partition. OK!
I (435) main: xDeviceInitialSettings:
I (435) main:   cDeviceSerial[sizeDeviceSerial]=ESP32S3_TEST1
I (445) main:   cUdpPin[sizeUdpPin]=1234
I (445) main:   cXorKey[sizeXorKey]=DEMO_KEY
I (455) main:   cThingName[sizeThingName]=ESP32S3_TEST1
I (455) main:   cCertificate[sizeCertificate]=-----BEGIN CERTIFICATE-----
...

and secondly, already as you said: ota_over_mqtt_demo can not run, : with "with error code 3."


I (2195) ota_over_mqtt_demo: OTA over MQTT demo, Application version 0.0.0
I (2205) main: START APPLICATION EWC
I (2205) main: ------ APPLICATION START ----------
...
I (9485) core_mqtt_agent_manager: TLS connection established.
I (10095) coreMQTT: MQTT connection established with the broker.
I (10095) core_mqtt_agent_manager: Session present: 0

I (10095) ota_over_mqtt_demo: coreMQTT-Agent connected. Resuming OTA agent.
I (10105) core_mqtt_agent_manager: coreMQTT-Agent connected.
I (10105) MQTT_UTILS: coreMQTT-Agent connected.
I (10185) ota_over_mqtt_demo: Request Job Document event Received

E (10185) ota_over_mqtt_demo: Failed to write job start next topic to buffer with error code 3.
WilliamFrasson commented 3 months ago

Hello, just un update, after I have excluded the ota_over_mqtt_demo from my code and all seem work without any error from the 16 hours, without any event of MQTTKeepAliveTimeout.

ActoryOu commented 3 months ago

Hi @WilliamFrasson, We recently fixed an issue on OTA demo via #103. Could you help pull latest version and test if same issue exists?

Thank you.

WilliamFrasson commented 3 months ago

Hi @ActoryOu I enabled the ota demo test with the ota fix, and now running .. after 4 hours i don't get MQTTKeepAliveTimeout events.

kar-rahul-aws commented 3 months ago

Hi @JasonYan324 Since you were also facing a similar issue , can you confirm if the https://github.com/FreeRTOS/iot-reference-esp32c3/pull/103 change works for you as well?

JasonYan324 commented 3 months ago

Hi @JasonYan324 Since you were also facing a similar issue , can you confirm if the #103 change works for you as well?

Thanks,I have already update for test, and gonging well

ActoryOu commented 3 months ago

I'm closing this because the patch works for multiple people. @NightSkySK, please feel free to create a new issue to describe your current status if the patch doesn't work.

Thank you.