[BUG] MQTTAgent_CommandLoop not return mqtt disconnected when internet disconnect while wifi still connect

samuelbles07 commented 2 years ago

Describe the bug Please provide a clear and concise description explaining the bug.

System information

Hardware board: ESP32 devkit C
IDE used: Visual Studio Code
Operating System: MacOS
Code version: 202107.00 and 202203.00
Project/Demo: CONFIG_CORE_MQTT_AGENT_DEMO_ENABLED

Expected behavior MQTTAgent_CommandLoop should return disconnected or any error event in mqtt_agent_task.c when internet disconnected while wifi still connected, just like demo CONFIG_CORE_MQTT_MUTUAL_AUTH_DEMO_ENABLED .

Screenshots or console output

From Publish 5 until 53, publish message still return success publishing message. Even though internet is disconnected around 3.5minutes. Logs like below

300 28150 [SubPub0] [simple_sub_pub_demo.c:520] [INFO] [MQTTDemo] Sending publish request to agent with message "Publisher0 publishing message 53" on topic "/filter/Publisher0"

301 28151 [SubPub0] [simple_sub_pub_demo.c:536] [INFO] [MQTTDemo] Task Publisher0 waiting for publish 53 to complete.

302 28152 [MQTT Agent] [core_mqtt_agent_command_functions.c:76] [INFO] [MQTT] Publishing message to /filter/Publisher0.

303 28152 [SubPub0] [simple_sub_pub_demo.c:551] [INFO] [MQTTDemo] Publish 53 completed successfully.

304 28154 [SubPub0] [simple_sub_pub_demo.c:554] [INFO] [MQTTDemo] Short delay before next publish...

But when i turn internet back on, then it notice that publishing message failed. Logs like below

305 28655 [SubPub0] [simple_sub_pub_demo.c:520] [INFO] [MQTTDemo] Sending publish request to agent with message "Publisher0 publishing message 54" on topic "/filter/Publisher0"

306 28656 [MQTT Agent] [core_mqtt_agent_command_functions.c:76] [INFO] [MQTT] Publishing message to /filter/Publisher0.

307 28657 [MQTT Agent] [transport_secure_sockets.c:158] [ERROR] [Transport_Secure_Sockets] Failed to send data over network. bytesSent=-1.

308 28657 [MQTT Agent] [core_mqtt.c:625] [ERROR] [MQTT] Transport send failed. Error code=-1.

309 28658 [MQTT Agent] [core_mqtt.c:1414] [ERROR] [MQTT] Transport send failed for PUBLISH header.

310 28659 [MQTT Agent] [core_mqtt.c:1973] [ERROR] [MQTT] MQTT PUBLISH failed with status MQTTSendFailed.

311 28660 [MQTT Agent] [core_mqtt_agent.c:1023] [ERROR] [MQTT] MQTT operation failed with status MQTTSendFailed

312 28661 [MQTT Agent] [mqtt_agent_task.c:800] [INFO] [MQTTDemo] Disconnecting TLS connection.

313 28662 [MQTT Agent] [transport_secure_sockets.c:562] [ERROR] [Transport_Secure_Sockets] Failed to close connection: SOCKETS_Shutdown call failed. -1

314 28663 [MQTT Agent] [mqtt_agent_task.c:885] [ERROR] [MQTTDemo] Could not reconnect to MQTT broker

315 28664 [MQTT Agent] [mqtt_agent_task.c:892] [INFO] [MQTTDemo] MQTT Agent task completed.

I keep demo code running, and i notice publishing message still return success, even though MQTT agent is already stopped.

317 28667 [SubPub0] [simple_sub_pub_demo.c:551] [INFO] [MQTTDemo] Publish 54 completed successfully.

318 28668 [SubPub0] [simple_sub_pub_demo.c:554] [INFO] [MQTTDemo] Short delay before next publish... 

319 29169 [SubPub0] [simple_sub_pub_demo.c:520] [INFO] [MQTTDemo] Sending publish request to agent with message "Publisher0 publishing message 55" on topic "/filter/Publisher0"

320 29170 [SubPub0] [simple_sub_pub_demo.c:536] [INFO] [MQTTDemo] Task Publisher0 waiting for publish 55 to complete.

321 30171 [SubPub0] [simple_sub_pub_demo.c:554] [INFO] [MQTTDemo] Short delay before next publish...

Steps to reproduce bug Just run it normally. I only change these line below so demo run longer:

#define mqttexampleDELAY_BETWEEN_PUBLISH_OPERATIONS_MS ( 5000U )
#define mqttexamplePUBLISH_COUNT ( 100UL )

Additional context

For testing, i am using tethring/share wifi from my android phone and turn on/off mobile data
Also, i test both release 202107.00 and 202203.00
In my real application (not demo), i am using 202107.00 and when i reproduce the bug, i see that message is not actually published, but still in the queue.

247 38345 [MQTT Agent] [core_mqtt_serializer.c:571] [ERROR] [MQTT] PUBLISH payload length of 1073538064 cannot exceed 268424674 so as not to exceed the maximum remaining length of MQTT 3.1.1 248 38346 [MQTT Agent] [core_mqtt_serializer.c:1871] [ERROR] [MQTT] PUBLISH packet remaining length exceeds 268435455, which is the maximum size allowed by MQTT 3.1.1.

Thank you!

ravibhagavandas commented 2 years ago

Hi @samuelbles07 Apologies for a delay in response. From the logs it seems that the while the link between the client and server is broken, the underlying TCP/IP stack does not detect the network connection close fast enough. I assume underlying TCP/IP stack used is LWIP here. If so you can optionally enable LWIP TCP Keep Alive mechanisms to detect lost connection. See here for more documentation on this. MQTT application level Keep Alive is not triggered here since the library is still able to send a packet out successfully over the TCP/IP connection.

samuelbles07 commented 2 years ago

Hi @ravibhagavandas ,

I see that LWIP_TCP_KEEPALIVE already enabled, i try to print it in first line at app_main(void).

    /* Perform any hardware initialization that does not require the RTOS to be
     * running.  */

    prvMiscInitialization();
    configPRINTF(("LWIP_TCP_KEEPALIVE: %d", LWIP_TCP_KEEPALIVE));

Output

0 0 [main] Initializing lwIP TCP stack

1 1 [main] LWIP_TCP_KEEPALIVE: 1

. . Btw, kind of out of topic. Is this repository still actively maintained? Because i see that the last couple of weeks there's no activity.

Although, thanks for the reply!

aggarg commented 2 years ago

MQTT_Agent the calls the coreMQTT's MQTT_Publish API to publish a message. This API in-turn calls the transport interface's send function which in this case is SecureSocketsTransport_Send. SecureSocketsTransport_Send calls SOCKETS_Send which eventually calls prvNetworkSend.

Can you put a log or breakpoint at the call to lwip_send in the prvNetworkSend function and check it's return value when you disconnect the network? My suspicion is that the LWIP stack takes some time in determining that the underlying connection is lost and keeps returning success till then. As a result, the MQTT stack wrongly thinks that it is able to publish message successfully. If it turns out to be true, the next step would be to determine if LWIP stack can be configured to detect disconnections faster.

samuelbles07 commented 2 years ago

Hi @aggarg

How i test is by send a heartbeat data every 60s and see that return value when successfully send data is 60 & 48

94 6009 [MQTT Agent] lwip_send return value: 60
95 6010 [MQTT Agent] lwip_send return value: 48

So after 2 expected success send data (around 2 minutes), i turn off mobile data from my phone (internet), then i wait until 20 minute, and lwip_send still return 60 and 40 (20 attempt publish). Then i try to turn on mobile data, suddenly it realize it is disconnected, that i think it is from keep alive function.

503 133184 [MQTT Agent] lwip_send return value: -1
504 133184 [MQTT Agent] [transport_secure_sockets.c:208] [ERROR] [Transport_Secure_Sockets] Failed to receive data over network. bytesReceived=-1

505 133184 [MQTT Agent] [core_mqtt_serializer.c:2396] [ERROR] [MQTT] A single byte was not read from the transport: transportStatus=-1.

506 133186 [MQTT Agent] [core_mqtt.c:1320] [ERROR] [MQTT] Receiving incoming packet length failed. Status=MQTTRecvFailed

507 133187 [MQTT Agent] [core_mqtt.c:2192] [ERROR] [MQTT] Exiting process loop due to failure: ErrorStatus=MQTTRecvFailed

508 133188 [MQTT Agent] [core_mqtt_agent.c:1023] [ERROR] [MQTT] MQTT operation failed with status MQTTRecvFailed

510 133190 [MQTT Agent] [transport_secure_sockets.c:562] [ERROR] [Transport_Secure_Sockets] Failed to close connection: SOCKETS_Shutdown call failed. -1

I am wondering, isn't keep alive should handle something like that? when client app send PINGREQ and wait for PINGRESP from broker?

What makes it confusing is if i try demo CONFIG_CORE_MQTT_MUTUAL_AUTH_DEMO_ENABLED, It works as expected. 1.5 time from keep alive value, app then get disconnected event.

aggarg commented 2 years ago

I am wondering, isn't keep alive should handle something like that? when client app send PINGREQ and wait for PINGRESP from broker?

Yes, it should.

MQTT_Agent calls coreMQTT's MQTT_ProcessLoop API periodically which handles the keep alive. MQTT_ProcessLoop calls MQTT_GetIncomingPacketTypeAndLength which in-turn calls the transport interface's receive function which in this case is SecureSocketsTransport_Recv. SecureSocketsTransport_Recv calls SOCKETS_Recv which in-turn calls prvNetworkRecv.

Can you put a log or breakpoint at the call to lwip_recv function to check it's return value when you disconnect the network?

aggarg commented 2 years ago

@samuelbles07 Did you get a chance to do that?

samuelbles07 commented 2 years ago

Hi, sorry for the late reply

Below is the result:

309 12027 [SubPub0] [simple_sub_pub_demo.c:522] [INFO] [MQTTDemo] Sending publish request to agent with message "Publisher0 publishing message 11" on topic "/filter/Publisher0"

310 12028 [SubPub0] [simple_sub_pub_demo.c:527] [INFO] [MQTTDemo] ulNotification 4294967284; ulValueToNotify 11

311 12029 [MQTT Agent] [core_mqtt_agent_command_functions.c:76] [INFO] [MQTT] Publishing message to /filter/Publisher0.

312 12030 [MQTT Agent] >>> lwip_recv(): -1
313 12030 [SubPub0] [simple_sub_pub_demo.c:539] [INFO] [MQTTDemo] Task Publisher0 waiting for publish 11 to complete.

314 12031 [SubPub0] [simple_sub_pub_demo.c:547] [INFO] [MQTTDemo] Command context return status: 0

315 12033 [SubPub0] [simple_sub_pub_demo.c:554] [INFO] [MQTTDemo] ulNotification 11; ulValueToNotify 11

316 12035 [SubPub0] [simple_sub_pub_demo.c:557] [INFO] [MQTTDemo] Publish 11 completed successfully.

317 12036 [SubPub0] [simple_sub_pub_demo.c:560] [INFO] [MQTTDemo] Short delay before next publish...

318 12131 [MQTT Agent] >>> lwip_recv(): -1

I try another things too, and i notice that

PINGREQ is not sent if application periodically publish message below keep-alive value. Is this by design?
When using QOS0 with no internet, prvWaitForCommandAcknowledgment() return pdTrue, so i know that prvPublishCommandCallback is called and publish succeeded (Expected because QOS0 doesn't wait for ACK) that define by comparing notification value.
When using QOS1 with no internet. Unlike QOS0, prvWaitForCommandAcknowledgment() receive pdFalse (maybe timeout) that i know prvPublishCommandCallback() is not called, so publish considered failed. Application able to define something happen to MQTT connection by failed publish message. But still MQTTAgent_CommandLoop() not return anything.
I try calling MQTT_Ping() manually every 60s, but when no internet, still MQTTAgent_CommandLoop not return anything. lwip_receive() not receive anything when sending PINGREQ packet when internet not available.

My application using QOS0, that's why it always return success publish, but do i really need to use QOS1? If yes, should i state the mqtt connection to disconnected when failed publish message?

Regardless, i think something need to be fixed if people decide to use QOS0.

FYI, demo config CONFIG_CORE_MQTT_MUTUAL_AUTH_DEMO_ENABLED using QOS1 and CONFIG_CORE_MQTT_AGENT_DEMO_ENABLED using QOS0.

samuelbles07 commented 2 years ago

I can fix my problem by adding flag to sendPacket() either packet will be consider last transmission time or not(Publish QOS0). If packet that will be sent expecting ACK, so it consider last transmission time.

in core_mqtt.c

sendPacket()

    /* Update time of last transmission if the entire packet is successfully sent. */
    if( totalBytesSent > 0 && updateLastPacketTime )
    {
        pContext->lastPacketTime = lastSendTimeMs;
        LogDebug( ( "Successfully sent packet at time %lu.",
                    ( unsigned long ) lastSendTimeMs ) );
    }

sendPublish()

bool updateLastPacketTime = true;
.
.
.
if ( pPublishInfo->qos == MQTTQoS0 ) {
        /*
         * Because QOS0 not wait for ACKs, it will never know if it is success or not.
         * So when application periodically publish data under keep alive interval time,
         * when internet suddenly unavailable in runtime, application will never get error event.
         */
        updateLastPacketTime = false;
    }

    /* Send header first. */
    bytesSent = sendPacket( pContext,
                            pContext->networkBuffer.pBuffer,
                            headerSize,
                            updateLastPacketTime );
.
.
.
bytesSent = sendPacket( pContext,
                                    pPublishInfo->pPayload,
                                    pPublishInfo->payloadLength,
                                    updateLastPacketTime );

aggarg commented 2 years ago

Thank you for investigating. So the MQTT PING mechanism is able to detect the broken connection but it was not getting triggered as the connection was not determined to be idle because of QoS0 publishes.

I have a question - From the logs, you shared before -

318 12131 [MQTT Agent] >>> lwip_recv(): -1

The loop should break when LWIP returns error:

`MQTTAgent_CommandLoop --> processCommand --> MQTT_ProcessLoop --> receiveSingleIteration --> MQTT_GetIncomingPacketTypeAndLength`

Can you trace why that is not happening? Also, I am looking into your proposed solution.

aggarg commented 2 years ago

Also, can you try this and let me know if it works for you - https://github.com/aggarg/coreMQTT/tree/keep_alive

samuelbles07 commented 2 years ago

No problem.

I put log in:

lwip_recv return value
SOCKETS_Recv return value
readFunc return value
MQTT_GetIncomingPacketTypeAndLength return value
receivePacket return value
receiveSingleIteration loop break
MQTT_ProcessLoop loop break

Logs

209 1573 [MQTT Agent] >>> lwip_recv(): -1
210 1573 [MQTT Agent] [transport_secure_sockets.c:200] [INFO] [Transport_Secure_Sockets] SOCKETS_Recv: 0

211 1573 [MQTT Agent] [core_mqtt_serializer.c:2364] [INFO] [MQTT] bytesReceived: 0

212 1574 [MQTT Agent] [core_mqtt.c:1301] [INFO] [MQTT] MQTT_GetIncomingPacketTypeAndLength: MQTTNoDataAvailable

214 1575 [MQTT Agent] [core_mqtt.c:1332] [INFO] [MQTT] receivePacket: MQTTNoDataAvailable

215 1576 [MQTT Agent] [core_mqtt.c:2218] [INFO] [MQTT] Exit from receiveSingleIteration loop. Status: MQTTSuccess

I think the problem is in here, it goes inside the statement, so the flow thinks everythings is okay.

samuelbles07 commented 2 years ago

Also, can you try this and let me know if it works for you - https://github.com/aggarg/coreMQTT/tree/keep_alive

Yes your solutions works

samuelbles07 commented 2 years ago

Related to why lwip_recv: -1, that maybe different issue.

I had an error that a couple times occured. This error always print out forever and not print others, it's like something is stuck that make all tasks stopped from scheduling. But i can't replicate it.

aggarg commented 2 years ago

I think the problem is in here, it goes inside the statement, so the flow thinks everythings is okay.

What is the value of errno? Enabling LWIP debug logs may also be helpful here.

samuelbles07 commented 2 years ago

I have no idea why but, errno value is 11

277 7503 [MQTT Agent] [core_mqtt.c:1072] [INFO] [MQTT] State record updated. New state=MQTTPublishDone.

278 7503 [MQTT Agent] [simple_sub_pub_demo.c:354] [INFO] [MQTTDemo] Received incoming publish message Publisher0 publishing message 3

279 7506 [MQTT Agent] >>> lwip_recv(): -1
280 7506 [MQTT Agent] >>> errno: 11

I put the logs as below

int ret = lwip_recv( ctx->ip_socket,
                        pucReceiveBuffer,
                        xReceiveLength,
                        ctx->recv_flag );
configPRINTF((">>> lwip_recv(): %d", ret));

if( -1 == ret )
{
    /*
        * 1. EWOULDBLOCK if the socket is NON-blocking, but there is no data
        *    when recv is called.
        * 2. EAGAIN if the socket would block and have waited long enough but
        *    packet is not received.
        */
    if( ( errno == EWOULDBLOCK ) || ( errno == EAGAIN ) )
    {
        configPRINTF((">>> errno: %d", errno));
        return SOCKETS_ERROR_NONE; /* timeout or would block */
    }

    /*
        * socket is not connected.
        */
    if( errno == EBADF )
    {
        return SOCKETS_ECLOSED;
    }
}

I mean, why it goes inside that statement while value of EAGAIN is 35 and EWOULDBLOCK is EAGAIN. Will enable LWIP debug later.

aggarg commented 2 years ago

I mean, why it goes inside that statement while value of EAGAIN is 35 and EWOULDBLOCK is EAGAIN.

That is certainly strange. Can you change the log line to below to confirm:

configPRINTF((">>> errno: %d, EWOULDBLOCK: %d, EAGAIN: %d", errno, EWOULDBLOCK, EAGAIN));

aggarg commented 1 year ago

I am closing this issue. Feel free to reopen or create a new one if you still face any issue.

aws / amazon-freertos

[BUG] MQTTAgent_CommandLoop not return mqtt disconnected when internet disconnect while wifi still connect #3482