commschamp / cc.mqttsn.libs

CommsChampion Ecosystem MQTT-SN client / gateway libraries and applications
Mozilla Public License 2.0
63 stars 16 forks source link

Ping is sometimes send twice #14

Closed trivialkettle closed 7 months ago

trivialkettle commented 7 months ago

Hi, I use the client on a MCU and notice strange disconnects to the MQTTSN gateway. The error was always that the gateway disconnected, though I see in the traffic that PINGREQ and PINGRESP are always on time. The disconnects happened often together with a valid ping in the logs:

<--- #4984
02 16                    | ..

---> #26
02 17                    | ..

I used retryPeriod = 5[s] and retryCount = 1.

I set retryCount = 3, m_keepAlivePeriod=10_000 and added some logs to the BasicClient.h and found the following:

The log format is:

internal clock in [ms], filename/function/line: some print

The "normal" operation looks like this:

I see a lot calls to updateTimestamp, and than about 30ms later tick() is called by my timer. tick() calls checkTimeouts(). It does not early exit because m_timestamp is not smaller than m_nextTimeoutTimestamp and finally calls checkPing().

bool needsToSendPing =
                ((m_lastSentMsgTimestamp + m_keepAlivePeriod) <= m_timestamp) ||
                ((m_lastRecvMsgTimestamp + m_keepAlivePeriod) <= m_timestamp);

evaluates to true because m_lastRecvMsgTimestamp (1322) + m_keepAlivePeriod(10_000) == m_timestamp(11322) and a ping is send.

15258 BasicClient.h/updateTimestamp:1948: 
15286 BasicClient.h/tick:591: 
15286 BasicClient.h/checkTimeouts:2188: m_timestamp: 11322 m_nextTimeoutTimestamp: 11322
15286 BasicClient.h/checkPing:2098: checkPing
15286 BasicClient.h/checkPing:2114: m_timestamp: 11322 m_lastSentMsgTimestamp: 11294 m_lastRecvMsgTimestamp 1322
15286 BasicClient.h/checkPing:2115: needsToSendPing 1
15286 BasicClient.h/sendPing:2640: count: 0 ts: 11322 lastping: 0

As expected this happens 10s later again:

25208 BasicClient.h/updateTimestamp:1948: 
25211 BasicClient.h/tick:591: 
25211 BasicClient.h/checkTimeouts:2188: m_timestamp: 21324 m_nextTimeoutTimestamp: 21324
25211 BasicClient.h/checkPing:2098: checkPing
25211 BasicClient.h/checkPing:2114: m_timestamp: 21324 m_lastSentMsgTimestamp: 21322 m_lastRecvMsgTimestamp 11324
25211 BasicClient.h/checkPing:2115: needsToSendPing 1
25211 BasicClient.h/sendPing:2640: count: 0 ts: 21324 lastping: 11322

But sometimes this happens: updateTimstamp calls checkTimeouts and m_timestamp == m_nextTimeoutTimestamp so it does not exit and calls checkPing(). There again is needs to ping, because m_lastRecvMsgTimestamp is 10s old. But then tick is called and m_timestamp is magically increased by 5000ms in just 4ms, so m_timestamp == m_nextTimeoutTimestamp again and the call to sendPing() in the last line of checkPing() is executed. So basically there are two pings in just 4ms, though the retry period is 5s. This lead to my gateway disconnect errors.

35123 BasicClient.h/updateTimestamp:1948: 
35123 BasicClient.h/checkTimeouts:2188: m_timestamp: 31329 m_nextTimeoutTimestamp: 31329
35123 BasicClient.h/checkPing:2098: checkPing
35123 BasicClient.h/checkPing:2114: m_timestamp: 31329 m_lastSentMsgTimestamp: 31328 m_lastRecvMsgTimestamp 21329
35123 BasicClient.h/checkPing:2115: needsToSendPing 1
35123 BasicClient.h/sendPing:2640: count: 0 m_timestamp: 31329 lastping: 21324
35125 BasicClient.h/updateTimestamp:1948: 
35127 BasicClient.h/tick:591: 
35127 BasicClient.h/checkTimeouts:2188: m_timestamp: 36329 m_nextTimeoutTimestamp: 36329
35127 BasicClient.h/checkPing:2098: checkPing
35127 BasicClient.h/checkPing:2132: lastLine
35127 BasicClient.h/sendPing:2640: count: 1 m_timestamp: 36329 lastping: 31329
35129 BasicClient.h/updateTimestamp:1948: 

10s later "normal" operation:

45077 BasicClient.h/updateTimestamp:1948: 
45088 BasicClient.h/tick:591: 
45088 BasicClient.h/checkTimeouts:2188: m_timestamp: 46374 m_nextTimeoutTimestamp: 46374
45088 BasicClient.h/checkPing:2098: checkPing
45088 BasicClient.h/checkPing:2114: m_timestamp: 46374 m_lastSentMsgTimestamp: 46363 m_nextTimeoutTimestamp 36374
45088 BasicClient.h/checkPing:2115: needsToSendPing 1
45088 BasicClient.h/sendPing:2640: count: 0 m_timestamp: 46374 lastping: 36329
45090 BasicClient.h/updateTimestamp:1948: 

then some later again double ping.

74874 BasicClient.h/updateTimestamp:1948: 
74874 BasicClient.h/checkTimeouts:2188: m_timestamp: 76383 m_nextTimeoutTimestamp: 76380
74874 BasicClient.h/checkPing:2098: checkPing
74874 BasicClient.h/checkPing:2114: m_timestamp: 76383 m_lastSentMsgTimestamp: 76345 m_lastRecvMsgTimestamp 66380
74874 BasicClient.h/checkPing:2115: needsToSendPing 1
74874 BasicClient.h/sendPing:2640: count: 0 m_timestamp: 76383 lastping: 66378
74876 BasicClient.h/updateTimestamp:1948: 
74878 BasicClient.h/tick:591: 
74878 BasicClient.h/checkTimeouts:2188: m_timestamp: 81383 m_nextTimeoutTimestamp: 81383
74878 BasicClient.h/checkPing:2098: checkPing
74878 BasicClient.h/checkPing:2132: lastLine
74878 BasicClient.h/sendPing:2640: count: 1 m_timestamp:  <missing logs>

My workaround is to set retryCount to 3.

Since I am on a MCU with pretty limited debug functionality I cannot really step into the code. I hope the logs help.

Best regards

arobenko commented 7 months ago

Hi @trivialkettle, I haven't really touched the MQTT-SN code for years and my in-depth understanding of it got a bit rusty over time. Besides, my current bandwidth is a bit limited to look into it in depth.

First of all, if I understood your description correctly you say you experience unexpected call to tick(). If my understanding is correct, please make sure when the callback set by the cc_mqttsn_client_set_cancel_next_tick_wait_callback() is called, the previously programmed timer is actually cancelled and only after the next invocation of the callback set by the cc_mqttsn_client_set_next_tick_program_callback() will be reprogrammed again with the new value.

To help me to help you to analyze the problem please provide the logs with timestamps from your application code:

It will help me see the sequence of the API calls and I might be able to reproduce the problem in the unit-testing and then be able to look into it properly.

trivialkettle commented 7 months ago

Hi @arobenko You are right, there is an unexpected call to tick(). It looks like that the timer is canceled and the cancel command is "stuck" in the task queue because of other stuff that is done so that the timer timeouts and queues tick() to the task queue.

This needs to be fixed on my side, thanks for your hint.