Closed samuelbles07 closed 1 year ago
Hi @samuelbles07 Apologies for a delay in response. From the logs it seems that the while the link between the client and server is broken, the underlying TCP/IP stack does not detect the network connection close fast enough. I assume underlying TCP/IP stack used is LWIP here. If so you can optionally enable LWIP TCP Keep Alive mechanisms to detect lost connection. See here for more documentation on this. MQTT application level Keep Alive is not triggered here since the library is still able to send a packet out successfully over the TCP/IP connection.
Hi @ravibhagavandas ,
I see that LWIP_TCP_KEEPALIVE
already enabled, i try to print it in first line at app_main(void)
.
/* Perform any hardware initialization that does not require the RTOS to be
* running. */
prvMiscInitialization();
configPRINTF(("LWIP_TCP_KEEPALIVE: %d", LWIP_TCP_KEEPALIVE));
Output
0 0 [main] Initializing lwIP TCP stack
1 1 [main] LWIP_TCP_KEEPALIVE: 1
. . Btw, kind of out of topic. Is this repository still actively maintained? Because i see that the last couple of weeks there's no activity.
Although, thanks for the reply!
MQTT_Agent the calls the coreMQTT's MQTT_Publish API to publish a message. This API in-turn calls the transport interface's send function which in this case is SecureSocketsTransport_Send. SecureSocketsTransport_Send calls SOCKETS_Send which eventually calls prvNetworkSend.
Can you put a log or breakpoint at the call to lwip_send in the prvNetworkSend function and check it's return value when you disconnect the network? My suspicion is that the LWIP stack takes some time in determining that the underlying connection is lost and keeps returning success till then. As a result, the MQTT stack wrongly thinks that it is able to publish message successfully. If it turns out to be true, the next step would be to determine if LWIP stack can be configured to detect disconnections faster.
Hi @aggarg
How i test is by send a heartbeat data every 60s and see that return value when successfully send data is 60 & 48
94 6009 [MQTT Agent] lwip_send return value: 60
95 6010 [MQTT Agent] lwip_send return value: 48
So after 2 expected success send data (around 2 minutes), i turn off mobile data from my phone (internet), then i wait until 20 minute, and lwip_send still return 60 and 40 (20 attempt publish). Then i try to turn on mobile data, suddenly it realize it is disconnected, that i think it is from keep alive function.
503 133184 [MQTT Agent] lwip_send return value: -1
504 133184 [MQTT Agent] [transport_secure_sockets.c:208] [ERROR] [Transport_Secure_Sockets] Failed to receive data over network. bytesReceived=-1
505 133184 [MQTT Agent] [core_mqtt_serializer.c:2396] [ERROR] [MQTT] A single byte was not read from the transport: transportStatus=-1.
506 133186 [MQTT Agent] [core_mqtt.c:1320] [ERROR] [MQTT] Receiving incoming packet length failed. Status=MQTTRecvFailed
507 133187 [MQTT Agent] [core_mqtt.c:2192] [ERROR] [MQTT] Exiting process loop due to failure: ErrorStatus=MQTTRecvFailed
508 133188 [MQTT Agent] [core_mqtt_agent.c:1023] [ERROR] [MQTT] MQTT operation failed with status MQTTRecvFailed
510 133190 [MQTT Agent] [transport_secure_sockets.c:562] [ERROR] [Transport_Secure_Sockets] Failed to close connection: SOCKETS_Shutdown call failed. -1
I am wondering, isn't keep alive should handle something like that? when client app send PINGREQ and wait for PINGRESP from broker?
What makes it confusing is if i try demo CONFIG_CORE_MQTT_MUTUAL_AUTH_DEMO_ENABLED
, It works as expected. 1.5 time from keep alive value, app then get disconnected event.
I am wondering, isn't keep alive should handle something like that? when client app send PINGREQ and wait for PINGRESP from broker?
Yes, it should.
MQTT_Agent calls coreMQTT's MQTT_ProcessLoop API periodically which handles the keep alive. MQTT_ProcessLoop calls MQTT_GetIncomingPacketTypeAndLength which in-turn calls the transport interface's receive function which in this case is SecureSocketsTransport_Recv. SecureSocketsTransport_Recv calls SOCKETS_Recv which in-turn calls prvNetworkRecv.
Can you put a log or breakpoint at the call to lwip_recv function to check it's return value when you disconnect the network?
@samuelbles07 Did you get a chance to do that?
Hi, sorry for the late reply
Below is the result:
309 12027 [SubPub0] [simple_sub_pub_demo.c:522] [INFO] [MQTTDemo] Sending publish request to agent with message "Publisher0 publishing message 11" on topic "/filter/Publisher0"
310 12028 [SubPub0] [simple_sub_pub_demo.c:527] [INFO] [MQTTDemo] ulNotification 4294967284; ulValueToNotify 11
311 12029 [MQTT Agent] [core_mqtt_agent_command_functions.c:76] [INFO] [MQTT] Publishing message to /filter/Publisher0.
312 12030 [MQTT Agent] >>> lwip_recv(): -1
313 12030 [SubPub0] [simple_sub_pub_demo.c:539] [INFO] [MQTTDemo] Task Publisher0 waiting for publish 11 to complete.
314 12031 [SubPub0] [simple_sub_pub_demo.c:547] [INFO] [MQTTDemo] Command context return status: 0
315 12033 [SubPub0] [simple_sub_pub_demo.c:554] [INFO] [MQTTDemo] ulNotification 11; ulValueToNotify 11
316 12035 [SubPub0] [simple_sub_pub_demo.c:557] [INFO] [MQTTDemo] Publish 11 completed successfully.
317 12036 [SubPub0] [simple_sub_pub_demo.c:560] [INFO] [MQTTDemo] Short delay before next publish...
318 12131 [MQTT Agent] >>> lwip_recv(): -1
I try another things too, and i notice that
prvWaitForCommandAcknowledgment()
receive pdFalse (maybe timeout) that i know prvPublishCommandCallback()
is not called, so publish considered failed. Application able to define something happen to MQTT connection by failed publish message. But still MQTTAgent_CommandLoop()
not return anything. MQTT_Ping()
manually every 60s, but when no internet, still MQTTAgent_CommandLoop
not return anything. lwip_receive()
not receive anything when sending PINGREQ packet when internet not available.My application using QOS0, that's why it always return success publish, but do i really need to use QOS1? If yes, should i state the mqtt connection to disconnected when failed publish message?
Regardless, i think something need to be fixed if people decide to use QOS0.
FYI, demo config CONFIG_CORE_MQTT_MUTUAL_AUTH_DEMO_ENABLED using QOS1 and CONFIG_CORE_MQTT_AGENT_DEMO_ENABLED using QOS0.
I can fix my problem by adding flag to sendPacket()
either packet will be consider last transmission time or not(Publish QOS0). If packet that will be sent expecting ACK, so it consider last transmission time.
in core_mqtt.c
sendPacket()
/* Update time of last transmission if the entire packet is successfully sent. */
if( totalBytesSent > 0 && updateLastPacketTime )
{
pContext->lastPacketTime = lastSendTimeMs;
LogDebug( ( "Successfully sent packet at time %lu.",
( unsigned long ) lastSendTimeMs ) );
}
sendPublish()
bool updateLastPacketTime = true;
.
.
.
if ( pPublishInfo->qos == MQTTQoS0 ) {
/*
* Because QOS0 not wait for ACKs, it will never know if it is success or not.
* So when application periodically publish data under keep alive interval time,
* when internet suddenly unavailable in runtime, application will never get error event.
*/
updateLastPacketTime = false;
}
/* Send header first. */
bytesSent = sendPacket( pContext,
pContext->networkBuffer.pBuffer,
headerSize,
updateLastPacketTime );
.
.
.
bytesSent = sendPacket( pContext,
pPublishInfo->pPayload,
pPublishInfo->payloadLength,
updateLastPacketTime );
Thank you for investigating. So the MQTT PING mechanism is able to detect the broken connection but it was not getting triggered as the connection was not determined to be idle because of QoS0 publishes.
I have a question - From the logs, you shared before -
318 12131 [MQTT Agent] >>> lwip_recv(): -1
The loop should break when LWIP returns error:
`MQTTAgent_CommandLoop --> processCommand --> MQTT_ProcessLoop --> receiveSingleIteration --> MQTT_GetIncomingPacketTypeAndLength`
Can you trace why that is not happening? Also, I am looking into your proposed solution.
Also, can you try this and let me know if it works for you - https://github.com/aggarg/coreMQTT/tree/keep_alive
No problem.
I put log in:
Logs
209 1573 [MQTT Agent] >>> lwip_recv(): -1
210 1573 [MQTT Agent] [transport_secure_sockets.c:200] [INFO] [Transport_Secure_Sockets] SOCKETS_Recv: 0
211 1573 [MQTT Agent] [core_mqtt_serializer.c:2364] [INFO] [MQTT] bytesReceived: 0
212 1574 [MQTT Agent] [core_mqtt.c:1301] [INFO] [MQTT] MQTT_GetIncomingPacketTypeAndLength: MQTTNoDataAvailable
214 1575 [MQTT Agent] [core_mqtt.c:1332] [INFO] [MQTT] receivePacket: MQTTNoDataAvailable
215 1576 [MQTT Agent] [core_mqtt.c:2218] [INFO] [MQTT] Exit from receiveSingleIteration loop. Status: MQTTSuccess
I think the problem is in here, it goes inside the statement, so the flow thinks everythings is okay.
Also, can you try this and let me know if it works for you - https://github.com/aggarg/coreMQTT/tree/keep_alive
Yes your solutions works
Related to why lwip_recv: -1
, that maybe different issue.
I had an error that a couple times occured. This error always print out forever and not print others, it's like something is stuck that make all tasks stopped from scheduling. But i can't replicate it.
I think the problem is in here, it goes inside the statement, so the flow thinks everythings is okay.
What is the value of errno? Enabling LWIP debug logs may also be helpful here.
I have no idea why but, errno value is 11
277 7503 [MQTT Agent] [core_mqtt.c:1072] [INFO] [MQTT] State record updated. New state=MQTTPublishDone.
278 7503 [MQTT Agent] [simple_sub_pub_demo.c:354] [INFO] [MQTTDemo] Received incoming publish message Publisher0 publishing message 3
279 7506 [MQTT Agent] >>> lwip_recv(): -1
280 7506 [MQTT Agent] >>> errno: 11
I put the logs as below
int ret = lwip_recv( ctx->ip_socket,
pucReceiveBuffer,
xReceiveLength,
ctx->recv_flag );
configPRINTF((">>> lwip_recv(): %d", ret));
if( -1 == ret )
{
/*
* 1. EWOULDBLOCK if the socket is NON-blocking, but there is no data
* when recv is called.
* 2. EAGAIN if the socket would block and have waited long enough but
* packet is not received.
*/
if( ( errno == EWOULDBLOCK ) || ( errno == EAGAIN ) )
{
configPRINTF((">>> errno: %d", errno));
return SOCKETS_ERROR_NONE; /* timeout or would block */
}
/*
* socket is not connected.
*/
if( errno == EBADF )
{
return SOCKETS_ECLOSED;
}
}
I mean, why it goes inside that statement while value of EAGAIN is 35 and EWOULDBLOCK is EAGAIN. Will enable LWIP debug later.
I mean, why it goes inside that statement while value of EAGAIN is 35 and EWOULDBLOCK is EAGAIN.
That is certainly strange. Can you change the log line to below to confirm:
configPRINTF((">>> errno: %d, EWOULDBLOCK: %d, EAGAIN: %d", errno, EWOULDBLOCK, EAGAIN));
I am closing this issue. Feel free to reopen or create a new one if you still face any issue.
Describe the bug Please provide a clear and concise description explaining the bug.
System information
202107.00
and202203.00
Expected behavior
MQTTAgent_CommandLoop
should return disconnected or any error event inmqtt_agent_task.c
when internet disconnected while wifi still connected, just like demo CONFIG_CORE_MQTT_MUTUAL_AUTH_DEMO_ENABLED .Screenshots or console output
Steps to reproduce bug Just run it normally. I only change these line below so demo run longer:
#define mqttexampleDELAY_BETWEEN_PUBLISH_OPERATIONS_MS ( 5000U )
#define mqttexamplePUBLISH_COUNT ( 100UL )
Additional context
202107.00
and202203.00
202107.00
and when i reproduce the bug, i see that message is not actually published, but still in the queue.Thank you!